View Single Post
Old 11-03-2014, 11:14 PM
daniel0 daniel0 is offline
Junior Member
Join Date: Nov 2014
Posts: 5
Default Re: Data Snooping with Test Set Inputs Intuition

I just got home, so I was able to read through some of chapter 5 on data snooping. It seems that the problem referenced with exchange rate predictions is particularly vulnerable to the problem. I can't express formally at the moment, but it seems like labels from the test set are making their way into the training set, since input data consists of data that perfectly matches labels (that is, a label from observation i will be part of the input data of observation i+1, given the way the data set is constructed). I would be interested in the results where the same experiment is ran with a much sparser dataset, such that any given rate change only shows up in one row of data.

So I suppose that there may be cases where incorporating test input data (not labels, just the raw unsupervised input), may be benign (like the example I gave in earlier posts), but it could have consequences in non-obvious ways.

Regarding dimensionality reduction, I've seen references to negative consequences and benign consequences. I have not run any experiments myself. It sounds like it could have non-obvious consequences (similar to the consequences of using test data for getting normalization parameters from Lecture 17).

Here's the example where someone references a problem:
"I detected only half of the generalization error rate when not redoing the PCA for every surrogate model"

Here's example where someone had no problem:

As before, any insight would be greatly appreciated, especially if any of these ideas have been formalized elsewhere.

Like I mentioned earlier, it's more obvious to me why validation may be inflated if labels from the test data were known (snooped) at the time of training. The following video provides an example and explanation:

Reply With Quote