View Single Post
Old 09-19-2012, 11:33 AM
gah44 gah44 is offline
Invited Guest
Join Date: Jul 2012
Location: Seattle, WA
Posts: 153
Default Re: Data independence

Originally Posted by magdon View Post
We should distinguish between similar inputs and non-independent inputs. If I am trying to learn f(x) and I generate independent inputs x_1 and x_2 and they happen to be close to each other, i.e. similar, so x_1\approx x_2, then it will be no surprise that f(x_1)\approx f(x_2). This is like spiderman 1, spiderman 2. These are similar inputs and it is no surprise that the user rated them similarly. It is true that having two similar inputs may not be as useful for learning about f as dissimilar inputs would have been (as dissimilar inputs tell you about f on "more" of the input space.

Similar does not mean non-independent.

However, in the Netflix example, there are subtle problems that you may be alluding to. Think about how a user chooses movies to rent. They have their tastes so they have a tendency to select movies of a certain type. This is how the training data is generated.
Yes. But one might choose Spiderman2 not because it is the type that they like, but because it is a sequel to Spiderman 1. Maybe the two should count more than 1 movie, but not quite as much as two independent movies would.

Now Netflix would like to learn to predict movie rating for the viewer. However, if Netflix selects a movie at random and rates it for the viewer, then the test point is not from the same distribution as the training data. If, on the other hand, the viewer selected a movie and asked for a rating, then this test point is from the same distribution as the training data. So one must be careful.
Maybe it is more obvious with Facebook. FB will suggest friends, saying we have three friends in common, but the friends are all related. Similar to the Spiderman case, where you watch a sequel just because it is a sequel, a sibling or child of a friend will also do a friend request.

Now, one way to account for this is to realize that two people are related, or that one movie is a sequel of another, but it should also be in the data.

Say, for example, that the data show that everyone who watched Spiderman 2 had also watched Spiderman 1, and, for the sake of this discussion, vice versa. It should, then, be completely obvious that there is no additional information from the fact that someone watched both. The combination should have weight 1.0 instead of 2.0 in any calculation. If not everyone watched both, but many did, then the weight should be between 1.0 and 2.0.
Reply With Quote