LFD Book Forum  

Go Back   LFD Book Forum > General > General Discussion of Machine Learning

 
 
Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #1  
Old 05-28-2013, 08:32 AM
kartikeya_t@yahoo.com kartikeya_t@yahoo.com is offline
Member
 
Join Date: Jul 2012
Posts: 17
Default Under-represented class data

Hello all,
This topic might have been discussed in some earlier posts, in which case I apologize for the repetition.
One of the characteristics I repeatedly see in many data sets is that some important class in a multi-class problem is under-represented. In other words, the data set doesn't have too many instances of points belonging to a particular class.
From Prof. Mostafa's lectures, I understand that there is an underlying probability distribution P that acts on the input space X to draw the samples from it, but this distribution is only a mathematical requisite for the Hoeffding and VC framework. If I understand correctly, it is not our objective to find or learn about this distribution in a typical machine learning problem. And yet, this distribution can cause the kind of under-representation scenario I described above.
My question is: does this under-representation have the potential to influence generalization? I feel that it must, but am not sure how to quantize it. What are the ways to overcome this problem, given a data set one can't change? Should one just wait for the data set to get equitable class representation over time? In fact, is the act of checking class representation an act of snooping in the first place?
I would appreciate any pointers or references on this matter.
Thank you.
Reply With Quote
 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 03:20 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.