LFD Book Forum  

Go Back   LFD Book Forum > General > General Discussion of Machine Learning

Reply
 
Thread Tools Display Modes
  #1  
Old 05-28-2013, 08:32 AM
kartikeya_t@yahoo.com kartikeya_t@yahoo.com is offline
Member
 
Join Date: Jul 2012
Posts: 17
Default Under-represented class data

Hello all,
This topic might have been discussed in some earlier posts, in which case I apologize for the repetition.
One of the characteristics I repeatedly see in many data sets is that some important class in a multi-class problem is under-represented. In other words, the data set doesn't have too many instances of points belonging to a particular class.
From Prof. Mostafa's lectures, I understand that there is an underlying probability distribution P that acts on the input space X to draw the samples from it, but this distribution is only a mathematical requisite for the Hoeffding and VC framework. If I understand correctly, it is not our objective to find or learn about this distribution in a typical machine learning problem. And yet, this distribution can cause the kind of under-representation scenario I described above.
My question is: does this under-representation have the potential to influence generalization? I feel that it must, but am not sure how to quantize it. What are the ways to overcome this problem, given a data set one can't change? Should one just wait for the data set to get equitable class representation over time? In fact, is the act of checking class representation an act of snooping in the first place?
I would appreciate any pointers or references on this matter.
Thank you.
Reply With Quote
  #2  
Old 05-28-2013, 01:15 PM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Re: Under-represented class data

From the way you put it, I think you know the answer is "yes".

Imagine if few or none of your samples of a function occur in a particular part of its domain: how much can you say about that part of the domain? If your sample of handwritten digits contains hardly any "q's, how well do you think your machine learning program will be able to recognise other examples of this letter? (assuming written q's vary quite a lot).

When it would not matter too much is when your data is unbalanced but it is in some sense adequate even where it is more sparse. Suppose you have 1000 examples of "q" and 10,000 of most other letters: maybe it won't do much harm, because your sample may include most of the key information to be found in a much larger sample. There is a diminishing return to having more examples, once you pass levels which are adequate for good generalisation.
Reply With Quote
  #3  
Old 05-30-2013, 05:09 AM
kartikeya_t@yahoo.com kartikeya_t@yahoo.com is offline
Member
 
Join Date: Jul 2012
Posts: 17
Default Re: Under-represented class data

Thank you Elroch. You are right, I do sense that there is a pitfall in sparse data sets.
The data set I have is already quite small (about 200 points), and has about 95% representation of one class and 5% of the other. This is data we are slowly gathering from the field (power grid instability - most of the time, things are running well!), and I expect that in time, we will have data, but this imbalance will always remain. Your point of having a sparse representation in a large data set is comforting, but I have not reached that situation yet.
Do you happen to know of any techniques that try to deal with this sparseness problem in smaller sets?
Thanks.
Reply With Quote
  #4  
Old 05-30-2013, 06:13 AM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Re: Under-represented class data

Given your numbers, the way I would be thinking of it is this. You have a classification problem in some space and only about 10 points in one of the two categories (if I understand correctly). If there is much noise in your data, you can think of this number as being reduced even further (that is my intuitive view, anyhow). The conclusion would be that simpler models are going to be better.

You don't need to quantify this yourself. If you do leave-one out cross-validation with a suitable SVM model, for example, you will automatically find a model that has an appropriate complexity and cost function for your data. I think this might work even with such a tiny sample of the category, although clearly there will be more risk of it being misled by chance.

One issue that may be important is the cost of errors of the two types: as the course emphasised, an asymmetry in the importance of errors affects your cost function, and this affects the optimal hypothesis.

I suspect if the data is noisy, points in category A may dominate points in category B to such an extent that avoiding the "always A" hypothesis, or something close to it will be an issue (because points in B may be like raisins in one side of a cake made of A). I think that was also in the course. You can deal with this with the error function, I believe. You can also think of it in terms of ending up with Baysian probability of one category or the other, given your data.

One idea I don't think we looked at (so maybe it's a bad one) is to combine logistic regression with non-linear transforms (if your data merits them) to predict these probabilities. If this method does make sense, perhaps it's possible to use leave-one-out cross validation, like for an SVM. Or perhaps not: would someone more knowledgeable on this particular topic please clarify?
Reply With Quote
  #5  
Old 05-31-2013, 06:13 AM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Re: Under-represented class data

Thinking about this later, it occurred to me that I was being a bit like the penguin in one of Yaser's slides, cheerfully going forth into a minefield.

I don't know the nature of your data set or the precise objective, and my idea about looking for a model is just an idea. [For one thing, Prof. Lin points out that there is some evidence that leave-one-out cross validation is sometimes unstable, although I still don't understand why this is]. So take my ideas with a pinch of salt!
Reply With Quote
  #6  
Old 06-01-2013, 09:01 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Under-represented class data

The VC bound still holds even if one class is much less frequent that the other. The VC bound does not depend on the test distribution.

You may wonder how this could possibly work in this extremely unbalanced case? It is taken into account by the fact that the out of sample test will also have just a small number of the under-represented class. Your concern is that you may not learn how to distinguish this poor class from the others because there is so few data representing it. In the end it wont matter because in the test sample it will not appear that often so even if you make more mistakes w.r.t. that class, it wont matter. All you need is the total N to be large enough and the data sampled i.i.d. from the test distribution

Note, you can have a class discrepancy for other reasons. Like in credit prediction, we have few defaults because many of the would be defaults were rejected credit card applications in the first place. This is a case where the in-sample is not an i.i.d. distribution from the test distribution. When this is the case, you have to be careful.

And yes, even if you look into the data to determine the fraction of any class, then it is snooping. Don't do it.


Quote:
Originally Posted by kartikeya_t@yahoo.com View Post
Hello all,
This topic might have been discussed in some earlier posts, in which case I apologize for the repetition.
One of the characteristics I repeatedly see in many data sets is that some important class in a multi-class problem is under-represented. In other words, the data set doesn't have too many instances of points belonging to a particular class.
From Prof. Mostafa's lectures, I understand that there is an underlying probability distribution P that acts on the input space X to draw the samples from it, but this distribution is only a mathematical requisite for the Hoeffding and VC framework. If I understand correctly, it is not our objective to find or learn about this distribution in a typical machine learning problem. And yet, this distribution can cause the kind of under-representation scenario I described above.
My question is: does this under-representation have the potential to influence generalization? I feel that it must, but am not sure how to quantize it. What are the ways to overcome this problem, given a data set one can't change? Should one just wait for the data set to get equitable class representation over time? In fact, is the act of checking class representation an act of snooping in the first place?
I would appreciate any pointers or references on this matter.
Thank you.
__________________
Have faith in probability
Reply With Quote
  #7  
Old 06-01-2013, 04:07 PM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Re: Under-represented class data

kartikeya, you might be interested in a lecture segment from a different course on machine learning which relates to a key part of your question.

Error metrics for skewed classes

which deals with the interesting error-related quantities "precision" and "recall" in such scenarios.

Alternatively, if you can decide what the cost of an error of either type would be, you can simply modify your error function to weight errors of one type more than errors of the other when training. If the costs chosen are accurate, I can't see how you could do better.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 08:38 AM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.