View Single Post
Old 06-01-2013, 09:01 AM
magdon's Avatar
magdon magdon is offline
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 596
Default Re: Under-represented class data

The VC bound still holds even if one class is much less frequent that the other. The VC bound does not depend on the test distribution.

You may wonder how this could possibly work in this extremely unbalanced case? It is taken into account by the fact that the out of sample test will also have just a small number of the under-represented class. Your concern is that you may not learn how to distinguish this poor class from the others because there is so few data representing it. In the end it wont matter because in the test sample it will not appear that often so even if you make more mistakes w.r.t. that class, it wont matter. All you need is the total N to be large enough and the data sampled i.i.d. from the test distribution

Note, you can have a class discrepancy for other reasons. Like in credit prediction, we have few defaults because many of the would be defaults were rejected credit card applications in the first place. This is a case where the in-sample is not an i.i.d. distribution from the test distribution. When this is the case, you have to be careful.

And yes, even if you look into the data to determine the fraction of any class, then it is snooping. Don't do it.

Originally Posted by View Post
Hello all,
This topic might have been discussed in some earlier posts, in which case I apologize for the repetition.
One of the characteristics I repeatedly see in many data sets is that some important class in a multi-class problem is under-represented. In other words, the data set doesn't have too many instances of points belonging to a particular class.
From Prof. Mostafa's lectures, I understand that there is an underlying probability distribution P that acts on the input space X to draw the samples from it, but this distribution is only a mathematical requisite for the Hoeffding and VC framework. If I understand correctly, it is not our objective to find or learn about this distribution in a typical machine learning problem. And yet, this distribution can cause the kind of under-representation scenario I described above.
My question is: does this under-representation have the potential to influence generalization? I feel that it must, but am not sure how to quantize it. What are the ways to overcome this problem, given a data set one can't change? Should one just wait for the data set to get equitable class representation over time? In fact, is the act of checking class representation an act of snooping in the first place?
I would appreciate any pointers or references on this matter.
Thank you.
Have faith in probability
Reply With Quote