View Single Post
Old 09-18-2012, 07:29 PM
magdon's Avatar
magdon magdon is offline
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 597
Default Re: Data independence

We should distinguish between similar inputs and non-independent inputs. If I am trying to learn f(x) and I generate independent inputs x_1 and x_2 and they happen to be close to each other, i.e. similar, so x_1\approx x_2, then it will be no surprise that f(x_1)\approx f(x_2). This is like spiderman 1, spiderman 2. These are similar inputs and it is no surprise that the user rated them similarly. It is true that having two similar inputs may not be as useful for learning about f as dissimilar inputs would have been (as dissimilar inputs tell you about f on "more" of the input space.

Similar does not mean non-independent.

However, in the Netflix example, there are subtle problems that you may be alluding to. Think about how a user chooses movies to rent. They have their tastes so they have a tendency to select movies of a certain type. This is how the training data is generated. Now Netflix would like to learn to predict movie rating for the viewer. However, if Netflix selects a movie at random and rates it for the viewer, then the test point is not from the same distribution as the training data. If, on the other hand, the viewer selected a movie and asked for a rating, then this test point is from the same distribution as the training data. So one must be careful.

Originally Posted by gah44 View Post
I was recently thinking about the Facebook friend suggesting algorithm,
though I think that the problem could also apply to Netflix.

The assumption is that data points are independent, and so contribute equally to the solution.

In the FB case, if I am friends with more than one person in a family, it has a strong tendency to suggest other friends of the family, stronger than it should. (Though FB doesn't necessarily know that they are related.)

In the Netflix case, if someone likes Spiderman 1, Spiderman 2, and Spiderman 3, that really isn't three independent samples. On the other hand, Spiderman 1 and Batman 1 should be considered more independent.

It seems to me that there should be enough in the data to extract some of this dependence.
Have faith in probability
Reply With Quote