View Single Post
Old 05-30-2013, 06:13 AM
Elroch Elroch is offline
Invited Guest
Join Date: Mar 2013
Posts: 143
Default Re: Under-represented class data

Given your numbers, the way I would be thinking of it is this. You have a classification problem in some space and only about 10 points in one of the two categories (if I understand correctly). If there is much noise in your data, you can think of this number as being reduced even further (that is my intuitive view, anyhow). The conclusion would be that simpler models are going to be better.

You don't need to quantify this yourself. If you do leave-one out cross-validation with a suitable SVM model, for example, you will automatically find a model that has an appropriate complexity and cost function for your data. I think this might work even with such a tiny sample of the category, although clearly there will be more risk of it being misled by chance.

One issue that may be important is the cost of errors of the two types: as the course emphasised, an asymmetry in the importance of errors affects your cost function, and this affects the optimal hypothesis.

I suspect if the data is noisy, points in category A may dominate points in category B to such an extent that avoiding the "always A" hypothesis, or something close to it will be an issue (because points in B may be like raisins in one side of a cake made of A). I think that was also in the course. You can deal with this with the error function, I believe. You can also think of it in terms of ending up with Baysian probability of one category or the other, given your data.

One idea I don't think we looked at (so maybe it's a bad one) is to combine logistic regression with non-linear transforms (if your data merits them) to predict these probabilities. If this method does make sense, perhaps it's possible to use leave-one-out cross validation, like for an SVM. Or perhaps not: would someone more knowledgeable on this particular topic please clarify?
Reply With Quote