Inproceedings,

Controlling Bias in Adaptive Data Analysis Using Information Theory

D. Russo, and J. Zou.
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, page 1232--1240. Cadiz, Spain, PMLR, (09--11 May 2016)

Full text

Abstract

Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.

BibTeX key: pmlr-v51-russo16
entry type: inproceedings
address: Cadiz, Spain
booktitle: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
year: 2016
month: 09--11 May
pages: 1232--1240
publisher: PMLR
series: Proceedings of Machine Learning Research
volume: 51
pdf: http://proceedings.mlr.press/v51/russo16.pdf
Document: http://proceedings.mlr.press/v51/russo16.html

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{pmlr-v51-russo16, abstract = {Modern big data settings often involve messy, high-dimensional data, where it is not clear a priori what are the right questions to ask. To extract the most insights from a dataset, the analyst typically needs to engage in an iterative process of adaptive data analysis. The choice of analytics to be performed next depends on the results of the previous analyses on the same data. It is commonly recognized that such adaptivity (also called researcher degrees of freedom), even if well-intentioned, can lead to false discoveries, contributing to the crisis of reproducibility in science. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models. We show how this framework can give rigorous insights into when commonly used feature selection protocols (e.g. rank selection) do and do not lead to biased estimation. We also show how recent insights from differential privacy emerge from this framework when the analyst is assumed to be adversarial, though our bounds applies in more general settings. We illustrate our results with simple simulations.}, added-at = {2019-11-14T17:36:07.000+0100}, address = {Cadiz, Spain}, author = {Russo, Daniel and Zou, James}, biburl = {https://www.bibsonomy.org/bibtex/2221abbe0361d6d9303b897dbf5015d00/kirk86}, booktitle = {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics}, description = {Controlling Bias in Adaptive Data Analysis Using Information Theory}, editor = {Gretton, Arthur and Robert, Christian C.}, interhash = {5ea95de615b81bde7f38e6d581905cdd}, intrahash = {221abbe0361d6d9303b897dbf5015d00}, keywords = {generalization information theory}, month = {09--11 May}, pages = {1232--1240}, pdf = {http://proceedings.mlr.press/v51/russo16.pdf}, publisher = {PMLR}, series = {Proceedings of Machine Learning Research}, timestamp = {2019-11-14T17:36:07.000+0100}, title = {Controlling Bias in Adaptive Data Analysis Using Information Theory}, url = {http://proceedings.mlr.press/v51/russo16.html}, volume = 51, year = 2016 }

BibSonomy

Controlling Bias in Adaptive Data Analysis Using Information Theory

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on