View Single Post
Old 12-04-2014, 10:01 AM
yaser's Avatar
yaser yaser is offline
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,478
Default Re: Snooping and unsupervised preprocessing

Strictly speaking, one can construct an unsupervised scheme (model selection based on data inputs only, without involving the output labels) that can ruin the chances for good generalization in subsequent use of supervised learning. One example, admittedly extreme, is to take {\bf x}_1,{\bf x}_2,\dots,{\bf x}_N and select a model that is free to choose the values of the output on this specific set of points, but forces an output of zero outside that set. This model can then achieve E_{\rm in}=0 in supervised learning by matching y_1,y_2,\dots,y_N, but will obviously perform poorly out of sample since it is forced to output 0 on any other point.

Practically speaking, 'reasonable' processing and model selection based on the data inputs doesn't normally contaminate the data much for subsequent supervised learning, so it is not unreasonable to state a heuristic rule that unsupervised processing of the data does not constitute data snooping.

In the financial data experiment given in Example 5.3 (also in Lecture 17 of the LFD course), the normalization step that constituted data snooping involved the outputs as well, since it was done on the daily returns before structuring the data into inputs and outputs, and the outputs were indeed among the daily returns.
Where everyone thinks alike, no one thinks very much
Reply With Quote