View Single Post
  #2  
Old 05-28-2013, 02:15 PM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Re: Under-represented class data

From the way you put it, I think you know the answer is "yes".

Imagine if few or none of your samples of a function occur in a particular part of its domain: how much can you say about that part of the domain? If your sample of handwritten digits contains hardly any "q's, how well do you think your machine learning program will be able to recognise other examples of this letter? (assuming written q's vary quite a lot).

When it would not matter too much is when your data is unbalanced but it is in some sense adequate even where it is more sparse. Suppose you have 1000 examples of "q" and 10,000 of most other letters: maybe it won't do much harm, because your sample may include most of the key information to be found in a much larger sample. There is a diminishing return to having more examples, once you pass levels which are adequate for good generalisation.
Reply With Quote