LFD Book Forum Questions on Lecture 9 (Linear Models II)
 User Name Remember Me? Password
 Register FAQ Calendar Mark Forums Read

 Thread Tools Display Modes
#1
08-09-2012, 05:45 AM
 hashable Junior Member Join Date: Jul 2012 Posts: 8
Questions on Lecture 9 (Linear Models II)

1. In the example in the lecture, we were cautioned against data snooping since looking at data can mean that we can be implicitly doing some learning in our head. My question is: Is it legitimate to look at DataSet 1 to identify my predictors, and then train on DataSet 2 with samples entirely different from DataSet 1? Of course, the out of sample error will be evaluated on DataSet 3 different from 1 and 2.

2. At the end of the lecture, somebody asked a question about multiclass classifiers and it was answered that it is commonly done using either one-vs-all training or one-vs-one training. My questions:
• 2-a) For the one-versus-all, we need to only build 'n' classifiers for n-classes. Whereas for one-versus-one, we have to build n-choose-two classifiers which can take much longer if we have many classes. Are there any inherent benefits to one-vs-one? If not, why do it at all since one-vs-all is faster to train?
• 2-b) Are there any reasons why one method is preferable over another? E.g Is there impact on accuracy/generalization by choosing either approach?

3. We used cross entropy error for logistic and squared error for linear. It was explained that the choice of error is so that the math becomes easy with respect to implementation of the minimization. In both cases, the practical interpretation was explained and it appears intuitive. My questions:
• 3-a) Does the choice of error-measure affect the final choice of approximation? In other words, will we get a different g depending on whether we use linear or squared or any other error function? (Ignore the complexity of the math with respect to minimization for now.)
• 3-b)If we optimize to find g using one error function, but evaluate using a different error function, will the evaluation be meaningful? E.g. Use squared error to evaluate out of sample performance for a logistic model built by minimizing cross entropy error.
#2
08-09-2012, 06:19 AM
 yaser Caltech Join Date: Aug 2009 Location: Pasadena, California, USA Posts: 1,478
Re: Questions on Lecture 9 (Linear Models II)

Quote:
 Originally Posted by hashable 1. In the example in the lecture, we were cautioned against data snooping since looking at data can mean that we can be implicitly doing some learning in our head. My question is: Is it legitimate to look at DataSet 1 to identify my predictors, and then train on DataSet 2 with samples entirely different from DataSet 1? Of course, the out of sample error will be evaluated on DataSet 3 different from 1 and 2.
Yes, this is legitimate.

Quote:
 2. At the end of the lecture, somebody asked a question about multiclass classifiers and it was answered that it is commonly done using either one-vs-all training or one-vs-one training. My questions: 2-a) For the one-versus-all, we need to only build 'n' classifiers for n-classes. Whereas for one-versus-one, we have to build n-choose-two classifiers which can take much longer if we have many classes. Are there any inherent benefits to one-vs-one? If not, why do it at all since one-vs-all is faster to train? 2-b) Are there any reasons why one method is preferable over another? E.g Is there impact on accuracy/generalization by choosing either approach?
There is a significant body of work on multiclass in machine learning that you can explore in the open literature, and considerations of generalization and computation are key issues as you mentioned. The answer in the lecture addressed one-versus-one and one-versus-all because of their conceptual simplicity.

Quote:
 3. We used cross entropy error for logistic and squared error for linear. It was explained that the choice of error is so that the math becomes easy with respect to implementation of the minimization. In both cases, the practical interpretation was explained and it appears intuitive. My questions: 3-a) Does the choice of error-measure affect the final choice of approximation? In other words, will we get a different g depending on whether we use linear or squared or any other error function? (Ignore the complexity of the math with respect to minimization for now.) 3-b)If we optimize to find g using one error function, but evaluate using a different error function, will the evaluation be meaningful? E.g. Use squared error to evaluate out of sample performance for a logistic model built by minimizing cross entropy error.
The choice of error measure does affect the final hypothesis, and you can certainly evaluate different error measures on the same hypothesis. It is meaningful in the sense that it does measure the error in a particular way, but it may be hard to interpret the errors when they come from different measures.
__________________
Where everyone thinks alike, no one thinks very much
#3
08-09-2012, 11:26 PM
 gah44 Invited Guest Join Date: Jul 2012 Location: Seattle, WA Posts: 153
Re: Questions on Lecture 9 (Linear Models II)

Quote:
 Originally Posted by hashable (snip) 3. We used cross entropy error for logistic and squared error for linear. It was explained that the choice of error is so that the math becomes easy with respect to implementation of the minimization. In both cases, the practical interpretation was explained and it appears intuitive. My questions: 3-a) Does the choice of error-measure affect the final choice of approximation? In other words, will we get a different g depending on whether we use linear or squared or any other error function? (Ignore the complexity of the math with respect to minimization for now.) 3-b)If we optimize to find g using one error function, but evaluate using a different error function, will the evaluation be meaningful? E.g. Use squared error to evaluate out of sample performance for a logistic model built by minimizing cross entropy error.

Statisticians don't like squared error much. It seems that minimizing the sum of absolute values of differences, instead of the square, gives better results, but the math is harder. Least squares is too sensitive to one outlier, for example.

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 07:28 PM.

 Contact Us - LFD Book - Top