1. In the example in the lecture, we were cautioned against data snooping since looking at data can mean that we can be implicitly doing some learning in our head. My question is: Is it legitimate to look at DataSet 1 to identify my predictors, and then train on DataSet 2 with samples entirely different from DataSet 1? Of course, the out of sample error will be evaluated on DataSet 3 different from 1 and 2.

2. At the end of the lecture, somebody asked a question about multiclass classifiers and it was answered that it is commonly done using either one-vs-all training or one-vs-one training. My questions:

- 2-a) For the one-versus-all, we need to only build 'n' classifiers for n-classes. Whereas for one-versus-one, we have to build n-choose-two classifiers which can take much longer if we have many classes. Are there any inherent benefits to one-vs-one? If not, why do it at all since one-vs-all is faster to train?

- 2-b) Are there any reasons why one method is preferable over another? E.g Is there impact on accuracy/generalization by choosing either approach?

3. We used cross entropy error for logistic and squared error for linear. It was explained that the choice of error is so that the math becomes easy with respect to implementation of the minimization. In both cases, the practical interpretation was explained and it appears intuitive. My questions:

- 3-a) Does the choice of error-measure affect the final choice of approximation? In other words, will we get a different
**g** depending on whether we use linear or squared or any other error function? (Ignore the complexity of the math with respect to minimization for now.)
- 3-b)If we optimize to find
**g** using one error function, but evaluate using a different error function, will the evaluation be meaningful? E.g. Use squared error to evaluate out of sample performance for a logistic model built by minimizing cross entropy error.