LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Homework 7 (http://book.caltech.edu/bookforum/forumdisplay.php?f=136)
-   -   Using the whole Data lec13 (http://book.caltech.edu/bookforum/showthread.php?t=1073)

 Andrs 08-21-2012 07:42 AM

Using the whole Data lec13

In the lecture 13 (Validation),
The 10-Fold cross validation mechanism with the training data D is used to select the best "learning model" . My question is if there is any point in running the selected hypothesis (best_Hypotheses in the selected model) using the whole training data set (D) in order to get a better estimate of Eout . Or is the Ecv (cross validation Error) a good enough estimate of Eout.

 yaser 08-21-2012 12:06 PM

Re: Using the whole Data lec13

Quote:
 Originally Posted by Andrs (Post 4205) In the lecture 13 (Validation), The 10-Fold cross validation mechanism with the training data D is used to select the best "learning model" . My question is if there is any point in running the selected hypothesis (best_Hypotheses in the selected model) using the whole training data set (D) in order to get a better estimate of Eout . Or is the Ecv (cross validation Error) a good enough estimate of Eout.
It is a good idea to restore the full data set and use it for training once the model has been selected, but the problem with using the full data set for estimating for any hypothesis in this process is that part of the data set would have already been used for training to come up with this hypothesis, so that part will have a built-in bias. The cross-validation data points, although they are fewer, do not have that bias hence their estimate of is more reliable.

 Andrs 08-21-2012 12:25 PM

Re: Using the whole Data lec13

I would like to check that I really understood your recomendation: I will be consuming all my trainning-data with the cross validation procedure. Through the CV I select the model and the hypothesis (g-) with the corresponding parameters and I get Ecv that is a good estimate of Eout.
Your suggestion is that I could use this model (hypothesis set) and (re)train it on the full trainning-data in order to select a new hypothesis(g+). This new hypothesis(g+) may do better than the hypothesis (g-) but the only safer estimate for Eout is the estimate that I got thru the cross validation(Ecv). The only "problem" here is that now I do not have any data to "test" this new hypothesis (g+).

 yaser 08-21-2012 12:52 PM

Re: Using the whole Data lec13

Quote:
 Originally Posted by Andrs (Post 4223) Thanks for the quick answer. I would like to check that I really understood your recomendation: I will be consuming all my trainning-data with the cross validation procedure. Through the CV I select the model and the hypothesis (g-) with the corresponding parameters and I get Ecv that is a good estimate of Eout. Your suggestion is that I could use this model (hypothesis set) and (re)train it on the full trainning-data in order to select a new hypothesis(g+). This new hypothesis(g+) may do better than the hypothesis (g-) but the only safer estimate for Eout is the estimate that I got thru the cross validation(Ecv). The only "problem" here is that now I do not have any data to "test" this new hypothesis (g+).
The hypothesis trained on the full data set, denoted by which you refer to as g+, is indeed the result of this process. To estimate its , we still use the cross validation estimate for , notwithstanding the fact that it is a different hypothesis (but close enough) for the reason you outline; we have no cross validation data points left to evaluate directly.

 rainbow 08-21-2012 12:58 PM

Re: Using the whole Data lec13

I think you summarized the idea very well. I guess the idea behind CV is to estimate the E_out (by E_cv) in situations when you are short on data to start with. Then you can't afford losing data points when you reevaluate the optimal on in order to get .

 All times are GMT -7. The time now is 06:56 AM.