LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 5 - Three Learning Principles

Thread Tools Display Modes
Prev Previous Post   Next Post Next
Old 11-03-2014, 08:12 PM
daniel0 daniel0 is offline
Junior Member
Join Date: Nov 2014
Posts: 5
Default Data Snooping with Test Set Inputs Intuition

Lecture 17 gives an example where test data is used to calculate means for pre-processing training data. It is indicated that doing so will bias the results such that the performance will be inflated when the model is tested on the test set.

It makes sense to me that test data should not be used at all for learning parameters of a model, including parameters for pre-processing. After all, when a model is used in production, the pre-processing parameters have to already exist, and can't be a function of online data.

However, I am having a difficult time understanding the intuition regarding the example from Lecture 17. Why is it that using test data to calculate means for normalizing the data, improves the performance when testing the model? It is more clear to me why the test scores would be inflated if say, the test labels were somehow incorporated into the training process (maybe by doing feature selection prior to splitting the data).

Reply With Quote

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT -7. The time now is 05:51 AM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.