View Single Post
Old 06-11-2012, 03:05 PM
htlin's Avatar
htlin htlin is offline
Join Date: Aug 2009
Location: Taipei, Taiwan
Posts: 601
Default Re: Data Snooping, Classifiers

Originally Posted by dbl001 View Post

This is an excerpt from 'Mahout in Action' chapter 14 on building a classifier:

Preliminary analysis of data is critical to successful classification. It’s sometimes fun because the analysis often turns up Easter eggs like the Moon-Phase header line in table 14.2. These surprises can also be important in building a classifier, because they can uncover problems in the data or give you a key insight that simplifies the classification problem. Visualize early and visualize often.

Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman (2012-01-16 18:35:04.792000-06:00). Mahout in Action (Kindle Locations 6297-6300). Manning Publications. Kindle Edition.

Would this be considered 'Data Snooping'?

Thanks in Advance
Yes, it is snooping.

There is a fine line between data analysis (including visualization) and risky data snooping, though. In practical data mining applications (such as the KDD Cups that National Taiwan University has won in previous years), "careful data analysis" is important for reaching the best solution. For instance, in KDD Cup last year, without analysis/snooping, we could never have known that the music-ratings generated from the Yahoo! system went through several phase changes because of the upgrading of the system.

Being not only a machine learning researcher but also a data mining practitioner, my advice is
(0) Never snoop the test set, because it is a point of no return.
(1) Be careful when analyzing/snooping the training set, and account for the complexity for every analyzing/snooping steps.

Hope this helps.
When one teaches, two learn.
Reply With Quote