Quote:
Originally Posted by gah44
Well, it is also that the two point data set is small relative to the two parameter hypotheses. If you have 100 points, and 99th degree polynomials, it would also have large variance. I will guess that minimizing bias plus variance happens with the number of fit parameters near the square root of the number of points per data set.

Large variance, sure. I was trying to understand why large bias. If you take a huge number of 100point datasets, learn a hypothesis from each, and take the average value of these, why might it be far from the target function's value?
On the other hand, I'm not sure how to prove that it won't be far