LFD Book Forum data snooping
 User Name Remember Me? Password
 Register FAQ Calendar Mark Forums Read

 Thread Tools Display Modes
#1
10-22-2016, 10:43 AM
 Sangrock Lee Junior Member Join Date: Aug 2016 Posts: 6
data snooping

Is it data snooping even looingk at training data set? Assume that test data set is completely unknown, Of course.
#2
11-03-2016, 10:08 AM
 CountVonCount Member Join Date: Oct 2016 Posts: 17
Re: data snooping

Quote:
 Originally Posted by Sangrock Lee Is it data snooping even looingk at training data set? Assume that test data set is completely unknown, Of course.
That is an interesting question and I don't know the exact answer. But I try to give an answer that fits to my understanding.

If you look at the trainings data set you do some "learning" in your mind. Thus you decrease the number of hypothesis dramatically by choosing a hypothesis set that seems to fit to the trainings data.
This means you cannot work with from the reduced hypothesis set to calculate the generalization bound. Instead you need to use a higher , but it is unclear which to use, since you don't know exactly the of the full hypothesis set in your mind before looking at the data.

However if you have not looked at the test-data and keep this data safe until you find the final hypothesis g(x) you can verify your final hypothesis with the test-data. The result is and with the Hoeffding-bound you can estimate your completely independent of the VC-Dimension value.

Thus my answer is: Yes it is snooping, if you look at the trainings data. Thus you cannot calculate the generalization bound out of the VC-Dimension. But since you have not looked at the test-data you can instead calculate the Hoeffding-bound and the result is a valid estimate for the out-of-sample error.
However keep in mind, that after this calculation your test-data is also compromised and you cannot simply repeat the procedure, if the result is not as expected.
#3
12-17-2016, 05:24 PM
 Sangrock Lee Junior Member Join Date: Aug 2016 Posts: 6
Re: data snooping

Ah, thanks a ton!!!!! It sounds like there are two kinds of data snooping: i) looking at the training data and ii) looking at the test data. I guess looking at the training data can be commonly and inevitably happening if we are to use learning algorithms which require a training process, such as neural network, PLA, support vector machine, and so on.
#4
12-26-2016, 11:29 PM
 hidir Junior Member Join Date: Dec 2016 Location: orlando Posts: 1
Re: data snooping

Quote:
 Originally Posted by CountVonCount That is an interesting question and I don't know the exact answer. But I try to give an answer that fits to my understanding. If you look at the trainings data set you do some "learning" in your mind. Thus you decrease the number of hypothesis dramatically by choosing a hypothesis set that seems to fit to the trainings data. This means you cannot work with from the reduced hypothesis set to calculate the generalization bound. Instead you need to use a higher , but it is unclear which to use, since you don't know exactly the of the full hypothesis set in your mind before looking at the data. However if you have not looked at the test-data and keep this data safe until you find the final hypothesis g(x) you can verify your final hypothesis with the test-data. The result is and with the Hoeffding-bound you can estimate your completely independent of the VC-Dimension value. Thus my answer is: Yes it is snooping, if you look at the trainings data. Thus you cannot calculate the generalization bound out of the VC-Dimension. But since you have not looked at the test-data you can instead calculate the Hoeffding-bound and the result is a valid estimate for the out-of-sample error. However keep in mind, that after this calculation your test-data is also compromised and you cannot simply repeat the procedure, if the result is not as expected.
thanks
__________________
[URL="http://www.cevizcibaba.com.tr"]ceviz fidanı[/URL] - [URL="http://ozguneyhavalandirma.com"]denizli havalandırma [/URL]

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 04:53 AM.

 Contact Us - LFD Book - Top

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.