LFD Book Forum Data independence
 User Name Remember Me? Password
 FAQ Calendar Mark Forums Read

 Thread Tools Display Modes
#1
09-18-2012, 01:12 PM
 gah44 Invited Guest Join Date: Jul 2012 Location: Seattle, WA Posts: 153
Data independence

I was recently thinking about the Facebook friend suggesting algorithm,
though I think that the problem could also apply to Netflix.

The assumption is that data points are independent, and so contribute equally to the solution.

In the FB case, if I am friends with more than one person in a family, it has a strong tendency to suggest other friends of the family, stronger than it should. (Though FB doesn't necessarily know that they are related.)

In the Netflix case, if someone likes Spiderman 1, Spiderman 2, and Spiderman 3, that really isn't three independent samples. On the other hand, Spiderman 1 and Batman 1 should be considered more independent.

It seems to me that there should be enough in the data to extract some of this dependence.
#2
09-18-2012, 07:29 PM
 magdon RPI Join Date: Aug 2009 Location: Troy, NY, USA. Posts: 595
Re: Data independence

We should distinguish between similar inputs and non-independent inputs. If I am trying to learn and I generate independent inputs and and they happen to be close to each other, i.e. similar, so , then it will be no surprise that . This is like spiderman 1, spiderman 2. These are similar inputs and it is no surprise that the user rated them similarly. It is true that having two similar inputs may not be as useful for learning about as dissimilar inputs would have been (as dissimilar inputs tell you about on "more" of the input space.

Similar does not mean non-independent.

However, in the Netflix example, there are subtle problems that you may be alluding to. Think about how a user chooses movies to rent. They have their tastes so they have a tendency to select movies of a certain type. This is how the training data is generated. Now Netflix would like to learn to predict movie rating for the viewer. However, if Netflix selects a movie at random and rates it for the viewer, then the test point is not from the same distribution as the training data. If, on the other hand, the viewer selected a movie and asked for a rating, then this test point is from the same distribution as the training data. So one must be careful.

Quote:
 Originally Posted by gah44 I was recently thinking about the Facebook friend suggesting algorithm, though I think that the problem could also apply to Netflix. The assumption is that data points are independent, and so contribute equally to the solution. In the FB case, if I am friends with more than one person in a family, it has a strong tendency to suggest other friends of the family, stronger than it should. (Though FB doesn't necessarily know that they are related.) In the Netflix case, if someone likes Spiderman 1, Spiderman 2, and Spiderman 3, that really isn't three independent samples. On the other hand, Spiderman 1 and Batman 1 should be considered more independent. It seems to me that there should be enough in the data to extract some of this dependence.
__________________
Have faith in probability
#3
09-19-2012, 11:33 AM
 gah44 Invited Guest Join Date: Jul 2012 Location: Seattle, WA Posts: 153
Re: Data independence

Quote:
 Originally Posted by magdon We should distinguish between similar inputs and non-independent inputs. If I am trying to learn and I generate independent inputs and and they happen to be close to each other, i.e. similar, so , then it will be no surprise that . This is like spiderman 1, spiderman 2. These are similar inputs and it is no surprise that the user rated them similarly. It is true that having two similar inputs may not be as useful for learning about as dissimilar inputs would have been (as dissimilar inputs tell you about on "more" of the input space. Similar does not mean non-independent. However, in the Netflix example, there are subtle problems that you may be alluding to. Think about how a user chooses movies to rent. They have their tastes so they have a tendency to select movies of a certain type. This is how the training data is generated.
Yes. But one might choose Spiderman2 not because it is the type that they like, but because it is a sequel to Spiderman 1. Maybe the two should count more than 1 movie, but not quite as much as two independent movies would.

Quote:
 Now Netflix would like to learn to predict movie rating for the viewer. However, if Netflix selects a movie at random and rates it for the viewer, then the test point is not from the same distribution as the training data. If, on the other hand, the viewer selected a movie and asked for a rating, then this test point is from the same distribution as the training data. So one must be careful.
Maybe it is more obvious with Facebook. FB will suggest friends, saying we have three friends in common, but the friends are all related. Similar to the Spiderman case, where you watch a sequel just because it is a sequel, a sibling or child of a friend will also do a friend request.

Now, one way to account for this is to realize that two people are related, or that one movie is a sequel of another, but it should also be in the data.

Say, for example, that the data show that everyone who watched Spiderman 2 had also watched Spiderman 1, and, for the sake of this discussion, vice versa. It should, then, be completely obvious that there is no additional information from the fact that someone watched both. The combination should have weight 1.0 instead of 2.0 in any calculation. If not everyone watched both, but many did, then the weight should be between 1.0 and 2.0.
#4
09-19-2012, 04:57 PM
 magdon RPI Join Date: Aug 2009 Location: Troy, NY, USA. Posts: 595
Re: Data independence

Yes, if you select spiderman 2 because you first selected spiderman 1 then this is indeed non-independent sampling which is even worse than just having a mismatch between training and test probability distributions. In such cases, there may even be "effectively fewer" data points whenever non-independent sampling takes place.

Quote:
 Originally Posted by gah44 Yes. But one might choose Spiderman2 not because it is the type that they like, but because it is a sequel to Spiderman 1. Maybe the two should count more than 1 movie, but not quite as much as two independent movies would.
__________________
Have faith in probability

 Tags independence

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 12:12 AM.

 Contact Us - LFD Book - Top