LFD Book Forum  

Go Back   LFD Book Forum > General > General Discussion of Machine Learning

Reply
 
Thread Tools Display Modes
  #1  
Old 06-09-2012, 04:01 PM
dbl001 dbl001 is offline
Member
 
Join Date: Apr 2012
Posts: 11
Default Data Snooping, Classifiers

Hi,

This is an excerpt from 'Mahout in Action' chapter 14 on building a classifier:

Preliminary analysis of data is critical to successful classification. Itís sometimes fun because the analysis often turns up Easter eggs like the Moon-Phase header line in table 14.2. These surprises can also be important in building a classifier, because they can uncover problems in the data or give you a key insight that simplifies the classification problem. Visualize early and visualize often.

Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman (2012-01-16 18:35:04.792000-06:00). Mahout in Action (Kindle Locations 6297-6300). Manning Publications. Kindle Edition.

Would this be considered 'Data Snooping'?

Thanks in Advance
Reply With Quote
  #2  
Old 06-10-2012, 09:18 PM
Yellin Yellin is offline
Member
 
Join Date: Apr 2012
Posts: 26
Default Re: Data Snooping, Classifiers

You probably want an answer from an expert, but writing as just another student, I'd say yes, it is snooping, but it is nonetheless a good idea to do it because the learning apparatus between one's ears is likely to be superior in some respects to any published learning model. The problem with snooping is not that it's reprehensible. It's just that it's wrong to pretend it hasn't taken place when estimating the out of sample error.
Reply With Quote
  #3  
Old 06-11-2012, 04:05 PM
htlin's Avatar
htlin htlin is offline
NTU
 
Join Date: Aug 2009
Location: Taipei, Taiwan
Posts: 595
Default Re: Data Snooping, Classifiers

Quote:
Originally Posted by dbl001 View Post
Hi,

This is an excerpt from 'Mahout in Action' chapter 14 on building a classifier:

Preliminary analysis of data is critical to successful classification. It’s sometimes fun because the analysis often turns up Easter eggs like the Moon-Phase header line in table 14.2. These surprises can also be important in building a classifier, because they can uncover problems in the data or give you a key insight that simplifies the classification problem. Visualize early and visualize often.

Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman (2012-01-16 18:35:04.792000-06:00). Mahout in Action (Kindle Locations 6297-6300). Manning Publications. Kindle Edition.

Would this be considered 'Data Snooping'?

Thanks in Advance
Yes, it is snooping.

There is a fine line between data analysis (including visualization) and risky data snooping, though. In practical data mining applications (such as the KDD Cups that National Taiwan University has won in previous years), "careful data analysis" is important for reaching the best solution. For instance, in KDD Cup last year, without analysis/snooping, we could never have known that the music-ratings generated from the Yahoo! system went through several phase changes because of the upgrading of the system.

Being not only a machine learning researcher but also a data mining practitioner, my advice is
(0) Never snoop the test set, because it is a point of no return.
(1) Be careful when analyzing/snooping the training set, and account for the complexity for every analyzing/snooping steps.

Hope this helps.
__________________
When one teaches, two learn.
Reply With Quote
Reply

Tags
classifiers, data snooping

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 12:53 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.