LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Chapter 5 - Three Learning Principles (http://book.caltech.edu/bookforum/forumdisplay.php?f=112)
-   -   Data Snooping with Test Set Inputs Intuition (http://book.caltech.edu/bookforum/showthread.php?t=4535)

 daniel0 11-03-2014 07:12 PM

Data Snooping with Test Set Inputs Intuition

Lecture 17 gives an example where test data is used to calculate means for pre-processing training data. It is indicated that doing so will bias the results such that the performance will be inflated when the model is tested on the test set.

It makes sense to me that test data should not be used at all for learning parameters of a model, including parameters for pre-processing. After all, when a model is used in production, the pre-processing parameters have to already exist, and can't be a function of online data.

However, I am having a difficult time understanding the intuition regarding the example from Lecture 17. Why is it that using test data to calculate means for normalizing the data, improves the performance when testing the model? It is more clear to me why the test scores would be inflated if say, the test labels were somehow incorporated into the training process (maybe by doing feature selection prior to splitting the data).

Thanks,
Dan

 daniel0 11-03-2014 08:19 PM

Re: Data Snooping with Test Set Inputs Intuition

I can think of a hyperbolic example where having access to test inputs could bias a trained model to perform well on the test data. For example, when training using the training model, observations that are nearby points from the test data may be given extra weight, to ensure the model learns to do well on the test data.

Any intuition for the original example would still be appreciated though. That is, some intuitive reason for an accuracy bias (positive accuracy bias in the lecture 17 example) when normalizing the training data using means and variances that were calculated using both train and test data. As the data set size grows, it seems like the issue would decrease in severity, since the means and variances of test data and training data would probabilistically become closer as n grows.

Like I mentioned, it is more clear to me why this is problematic if labels from the test set were used during the training process.

I have heard the same warning given regarding dimensionality reduction (I am not referring to feature selection, where test data labels are used, and I intuitively understand the consequences). In such case, the warning is the same: when doing PCA (or some other unsupervised dimensionality reduction), do the pre-processing just on the training data and use the parameters to reduce dimensions of test data during evaluation. I also have a hard time intuitively seeing why this would bias results one way or the other.

 daniel0 11-03-2014 10:05 PM

Re: Data Snooping with Test Set Inputs Intuition

3 Attachment(s)
I'm still trying to wrap my head around this. I tried the following experiment using the cpu dataset in Weka. Using the dataset, I created 2 additional datasets:
1) a dataset that used all data in the original dataset to standardize the features (zero mean and unit variance)
2) a dataset that used the first half of the original dataset to standardize the features (shifted all data using the mean of the first half, and scaled it using the variance of the first half).

I then trained models using the first halves of the two datasets, and tested them on the second halves.

I used a linear regression.

Both models performed the same. They learned different parameters, but the performance measures were the same for both models.

Is there a specific type of model that this type of snooping effects? It did not appear to make a difference on linear regression.

I attached the datasets to this post. There are 3 attachments. They are all csv files (but I used txt extensions since uploads didn't work with csv extensions). The first file, cpu-original, has the original data. The next file, cpu-standardized-from-all, has the data that has been standardized using all observations. The last file, cpu-standardized-from-train, has the data which has been standardized using only the parameters (mean and variance) from the first half of the data (i.e., the training data).

Any insight would be greatly appreciated!
Thanks,
Dan

 daniel0 11-03-2014 10:21 PM

Re: Data Snooping with Test Set Inputs Intuition

Note: I'm not familiar with the cpu dataset. I just tried it to see the effects of standardizing using train+test data, versus standardizing only using training data parameters. I also split training and testing sets in half, which seemed inconsequential for the purpose of this experiment.

 daniel0 11-03-2014 11:14 PM

Re: Data Snooping with Test Set Inputs Intuition

I just got home, so I was able to read through some of chapter 5 on data snooping. It seems that the problem referenced with exchange rate predictions is particularly vulnerable to the problem. I can't express formally at the moment, but it seems like labels from the test set are making their way into the training set, since input data consists of data that perfectly matches labels (that is, a label from observation i will be part of the input data of observation i+1, given the way the data set is constructed). I would be interested in the results where the same experiment is ran with a much sparser dataset, such that any given rate change only shows up in one row of data.

So I suppose that there may be cases where incorporating test input data (not labels, just the raw unsupervised input), may be benign (like the example I gave in earlier posts), but it could have consequences in non-obvious ways.

Regarding dimensionality reduction, I've seen references to negative consequences and benign consequences. I have not run any experiments myself. It sounds like it could have non-obvious consequences (similar to the consequences of using test data for getting normalization parameters from Lecture 17).

Here's the example where someone references a problem:
"I detected only half of the generalization error rate when not redoing the PCA for every surrogate model"
http://stats.stackexchange.com/quest...ain-test-split

Here's example where someone had no problem:
http://mikelove.wordpress.com/2012/0...and-test-data/

As before, any insight would be greatly appreciated, especially if any of these ideas have been formalized elsewhere.

Like I mentioned earlier, it's more obvious to me why validation may be inflated if labels from the test data were known (snooped) at the time of training. The following video provides an example and explanation:

-Dan

 magdon 11-23-2014 06:37 AM

Re: Data Snooping with Test Set Inputs Intuition

Your questions are indeed subtle.

Indeed, it is very important to heed the warning at the bottom of page 9-5.

I highly recommend problem 9.10 as a concrete example of what can go wrong.

The problem that occurs can be illustrated with PCA, which does a form of dimensionality reduction. PCA identifies an `optimal' lower dimensional manifold on which the data sit. If you identify this manifold using test inputs, then you will (in some sense) be throwing away the least amount of the test inputs' information that you can, retaining only that part of each test input in the optimal lower dimension. Now, if you did the PCA using only the training data you will create your lower dimensional manifold to throw away the least amount of information in your training set. When you come to use this lower dimensional manifold on the test data (since it was not optimal for the test data), you will find that you may have thrown away important information in the test inputs which will hurt your test error.

The golden rule is that to make predictions on your test set, you can *only* use information from your training set. That is the way it is in practice, and that is the way you should evaluate yourself during the learning phase.

Here is a very simple way to check if you have data snooped. Before you do any learning assume the data has been split into a training and test set for you. Run your entire learning process and output your final hypothesis . Now, go and set all your data in your test set to strange values like 0 for all the inputs and random target labels. Run your entire learning process again on this new pair of training set and perturbed test set and output your final hypothesis . If then there has been data-snooping -- the test set in some way is influencing your choice of .

In learning from data, you must pay a price for any choices made using the data. Sometimes the price can be small or even zero, and sometimes it can be high. With snooping through input-preprocessing, the price is not easy to quantify, however, it is non-zero.

Quote:
 Originally Posted by daniel0 (Post 11809) I just got home, so I was able to read through some of chapter 5 on data snooping. It seems that the problem referenced with exchange rate predictions is particularly vulnerable to the problem. I can't express formally at the moment, but it seems like labels from the test set are making their way into the training set, since input data consists of data that perfectly matches labels (that is, a label from observation i will be part of the input data of observation i+1, given the way the data set is constructed). I would be interested in the results where the same experiment is ran with a much sparser dataset, such that any given rate change only shows up in one row of data. So I suppose that there may be cases where incorporating test input data (not labels, just the raw unsupervised input), may be benign (like the example I gave in earlier posts), but it could have consequences in non-obvious ways. Regarding dimensionality reduction, I've seen references to negative consequences and benign consequences. I have not run any experiments myself. It sounds like it could have non-obvious consequences (similar to the consequences of using test data for getting normalization parameters from Lecture 17). Here's the example where someone references a problem: "I detected only half of the generalization error rate when not redoing the PCA for every surrogate model" http://stats.stackexchange.com/quest...ain-test-split Here's example where someone had no problem: http://mikelove.wordpress.com/2012/0...and-test-data/ As before, any insight would be greatly appreciated, especially if any of these ideas have been formalized elsewhere. Like I mentioned earlier, it's more obvious to me why validation may be inflated if labels from the test data were known (snooped) at the time of training. The following video provides an example and explanation: https://www.youtube.com/watch?v=S06JpVoNaA0 -Dan

 All times are GMT -7. The time now is 02:31 PM.