Article,

Offline evaluation options for recommender systems

R. Cañamares, P. Castells, and A. Moffat.
Information Retrieval Journal, 23 (4): 387--410 (Aug 1, 2020)
DOI: 10.1007/s10791-020-09371-3

Abstract

We undertake a detailed examination of the steps that make up offline experiments for recommender system evaluation, including the manner in which the available ratings are filtered and split into training and test; the selection of a subset of the available users for the evaluation; the choice of strategy to handle the background effects that arise when the system is unable to provide scores for some items or users; the use of either full or condensed output lists for the purposes of scoring; scoring methods themselves, including alternative top-weighted mechanisms for condensed rankings; and the application of statistical testing on a weighted-by-user or weighted-by-volume basis as a mechanism for providing confidence in measured outcomes. We carry out experiments that illustrate the impact that each of these choice points can have on the usefulness of an end-to-end system evaluation, and provide examples of possible pitfalls. In particular, we show that varying the split between training and test data, or changing the evaluation metric, or how target items are selected, or how empty recommendations are dealt with, can give rise to comparisons that are vulnerable to misinterpretation, and may lead to different or even opposite outcomes, depending on the exact combination of settings used.

BibTeX key: Cañamares2020
entry type: article
year: 2020
month: aug
day: 01
journal: Information Retrieval Journal
number: 4
pages: 387--410
volume: 23
issn: 1573-7659
DOI: 10.1007/s10791-020-09371-3
url: https://doi.org/10.1007/s10791-020-09371-3

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{Cañamares2020, abstract = {We undertake a detailed examination of the steps that make up offline experiments for recommender system evaluation, including the manner in which the available ratings are filtered and split into training and test; the selection of a subset of the available users for the evaluation; the choice of strategy to handle the background effects that arise when the system is unable to provide scores for some items or users; the use of either full or condensed output lists for the purposes of scoring; scoring methods themselves, including alternative top-weighted mechanisms for condensed rankings; and the application of statistical testing on a weighted-by-user or weighted-by-volume basis as a mechanism for providing confidence in measured outcomes. We carry out experiments that illustrate the impact that each of these choice points can have on the usefulness of an end-to-end system evaluation, and provide examples of possible pitfalls. In particular, we show that varying the split between training and test data, or changing the evaluation metric, or how target items are selected, or how empty recommendations are dealt with, can give rise to comparisons that are vulnerable to misinterpretation, and may lead to different or even opposite outcomes, depending on the exact combination of settings used.}, added-at = {2020-06-25T15:51:28.000+0200}, author = {Ca{\~{n}}amares, Roc{\'i}o and Castells, Pablo and Moffat, Alistair}, biburl = {https://www.bibsonomy.org/bibtex/2277eb7a6d04936cb4644c4e417fc7f27/brusilovsky}, day = 01, description = {Offline evaluation options for recommender systems | SpringerLink}, doi = {10.1007/s10791-020-09371-3}, interhash = {93b536db7a32f77e03b8373f6074c6ac}, intrahash = {277eb7a6d04936cb4644c4e417fc7f27}, issn = {1573-7659}, journal = {Information Retrieval Journal}, keywords = {evaluation recommender}, month = aug, number = 4, pages = {387--410}, timestamp = {2020-06-25T15:51:28.000+0200}, title = {Offline evaluation options for recommender systems}, url = {https://doi.org/10.1007/s10791-020-09371-3}, volume = 23, year = 2020 }

BibSonomy

Offline evaluation options for recommender systems

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on