LFD Book Forum Isn't the bin (your data set) the sample?
 Register FAQ Calendar Mark Forums Read

#1
01-12-2013, 02:39 PM
 ArikB Junior Member Join Date: Oct 2012 Posts: 8
Isn't the bin (your data set) the sample?

This has me a bit confused, isn't the bin your data set in the analogy? And as such your data set is the sample of the population. For instance in the bank example, your data set would be the sample and the population would be all of the possible people applying for credit.

If that is the case then how is Hoeffding representative for anything that is "really" out of sample?

Or am I confused and the bin is really the population? Hence mu is then the population fraction and the samples you pick from the bin represent the data set?

Perhaps I should rephrase it to be a bit more systematic:

In the best case scenario my bin is completely green, i.e my hypothesis agrees entirely with my data set. So mu is then 1. Hoeffding gives me a probabilistic bound on how well nu approximates this mu (which is 1). That's nice, but now I only know that I have a hypothesis that agrees entirely with my dataset. But how does this generalize beyond my data set? Or am I going too far ahead and the lecture is not about this? If so, then why use nu at all? If this is a case of supervised learning and I know the output, then I can just see mu immediately because I can see whether or not my input agrees with the output within my data set.

Or is it so that the bin represents my training set and I have already (supposedly) subdivided my data set into a training set and data that I use for testing?
#2
01-12-2013, 06:08 PM
 butterscotch Caltech Join Date: Jan 2013 Posts: 43
Re: Isn't the bin (your data set) the sample?

mu denotes probability of green in the entire space, outside of D included.

In a marble in a bag example, bin is the entire space and the N marbles you picked are your data set. i.e. you do not know the colors of the rest of the marbles in the bin.

Consider the following example. There are 10000 marbles in the bag and you want to know the proportion of red and black marbles (proportion of red: mu). You can figure out the exact ratio by taking out all the marbles and counting all of them. But say you have to figure this out in a limited time, and can only afford to look at 100 marbles. You counted 30 red marbles, and 70 black marbles. Then v is 0.3. You do not know if marbles outside of your dataset agrees with it. But Hoeffding Inequality provides a bound for the probability for values of mu based on v.
#3
01-14-2013, 12:01 PM
 ArikB Junior Member Join Date: Oct 2012 Posts: 8
Re: Isn't the bin (your data set) the sample?

Quote:
 Originally Posted by butterscotch mu denotes probability of green in the entire space, outside of D included. In a marble in a bag example, bin is the entire space and the N marbles you picked are your data set. i.e. you do not know the colors of the rest of the marbles in the bin. Consider the following example. There are 10000 marbles in the bag and you want to know the proportion of red and black marbles (proportion of red: mu). You can figure out the exact ratio by taking out all the marbles and counting all of them. But say you have to figure this out in a limited time, and can only afford to look at 100 marbles. You counted 30 red marbles, and 70 black marbles. Then v is 0.3. You do not know if marbles outside of your dataset agrees with it. But Hoeffding Inequality provides a bound for the probability for values of mu based on v.
Thanks.

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 12:42 PM.