Re: Data Snooping with Test Set Inputs Intuition
I'm still trying to wrap my head around this. I tried the following experiment using the cpu dataset in Weka. Using the dataset, I created 2 additional datasets:
1) a dataset that used all data in the original dataset to standardize the features (zero mean and unit variance)
2) a dataset that used the first half of the original dataset to standardize the features (shifted all data using the mean of the first half, and scaled it using the variance of the first half).
I then trained models using the first halves of the two datasets, and tested them on the second halves.
I used a linear regression.
Both models performed the same. They learned different parameters, but the performance measures were the same for both models.
Is there a specific type of model that this type of snooping effects? It did not appear to make a difference on linear regression.
I attached the datasets to this post. There are 3 attachments. They are all csv files (but I used txt extensions since uploads didn't work with csv extensions). The first file, cpuoriginal, has the original data. The next file, cpustandardizedfromall, has the data that has been standardized using all observations. The last file, cpustandardizedfromtrain, has the data which has been standardized using only the parameters (mean and variance) from the first half of the data (i.e., the training data).
Any insight would be greatly appreciated!
Thanks,
Dan
