LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Chapter 3 - The Linear Model (http://book.caltech.edu/bookforum/forumdisplay.php?f=110)
-   -   Exercise 3.4 (http://book.caltech.edu/bookforum/showthread.php?t=4484)

tomaci_necmi 05-30-2014 05:21 AM

Exercise 3.4
 
I asked the question at math stack exchange,

http://math.stackexchange.com/questi...to-a-dataset-d


Can anyone explain me how it works ?

tomaci_necmi 05-30-2014 05:22 AM

Re: Exercise 3.4
 
In my textbook, there is a statement mentioned on the topic of linear regression/machine learning, and a question, which is simply quoted as,

Consider a noisy target, y = (w^{*})^T \textbf{x} + \epsilon, for generating the data, where \epsilon is a noise term with zero mean and \sigma^2 variance, independently generated for every example (\textbf{x},y). The expected error of the best possible linear fit to this target is thus \sigma^2.

For the data D =  \{ (\textbf{x}_1,y_1), ..., (\textbf{x}_N,y_N)  \}, denote the noise in y_nas \epsilon_n, and let \mathbf{\epsilon}   = [\epsilon_1, \epsilon_2, ...\epsilon_N]^T; assume that X^TX is invertible. By following the steps below, ***show that the expected in-sample error of linear regression with respect to D is given by***,

\mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})


Below is my methodology,


Book says that,

In-sample error vector, \hat{\textbf{y}} - \textbf{y}, can be expressed as (H-I)\epsilon, which is simply, hat matrix, H= X(X^TX)^{-1}X^T, times, error vector, \epsilon.

So, I calculated in-sample error, E_{in}( \textbf{w}_{lin} ), as,

E_{in}( \textbf{w}_{lin} ) = \frac{1}{N}(\hat{\textbf{y}} - \textbf{y})^T (\hat{\textbf{y}} - \textbf{y}) =  \frac{1}{N}  (\epsilon^T (H-I)^T (H-I) \epsilon)

Since it is given by the book that,

(I-H)^K = (I-H), and also (I-H) is symetric, trace(H) = d+1

I got the following simplified expression,

E_{in}( \textbf{w}_{lin} ) =\frac{1}{N}  (\epsilon^T (H-I)^T (H-I) \epsilon) = \frac{1}{N} \epsilon^T (I-H) \epsilon = \frac{1}{N} \epsilon^T \epsilon - \frac{1}{N} \epsilon^T H \epsilon


Here, I see that,

\mathbb{E}_D[\frac{1}{N} \epsilon^T \epsilon] = \frac {N \sigma^2}{N}

And, also, the sum formed by - \frac{1}{N} \epsilon^T H \epsilon, gives the following sum,

- \frac{1}{N} \epsilon^T H \epsilon = - \frac{1}{N} \{ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j \}

I undestand that,

- \frac{1}{N} \mathbb{E}_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] = - trace(H) \ \sigma^2 = - (d+1) \ \sigma^2


However, I don't understand why,

- \frac{1}{N} \mathbb{E}_D[\sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j ] = 0 \ \ \ \ \ \ \ \ \ \ \ \ (eq \ 1)


(eq 1) should be equal to zero in order to satisfy the equation,



\mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})


***Can any one mind to explain me why (eq1) leads to a zero result ?***

tomaci_necmi 05-30-2014 05:49 AM

Re: Exercise 3.4
 
Well all fits now, just my mind played a game to me.

off course \mathbb{E}[\epsilon_i] = 0

yaser 05-30-2014 11:46 AM

Re: Exercise 3.4
 
Thank you for the question and the answer.

yongxien 08-03-2015 01:33 AM

Re: Exercise 3.4
 
Why the last statement is 0? I don't quite understand. Does the mean being zero imply E(e_i) and E(e_j) = 0? I find it weird if that is the case. Because that will mean E(e_i) = 0 but E(e_i^2) = \sigma^2. I understand E(e_i^2) = \sigma^2 from statistics but not the first part.

If it is not the case, then what is the reason for the last statement to be 0?

htlin 08-03-2015 03:22 PM

Re: Exercise 3.4
 
In the problem statement, I think "zero mean of the noise" is a given condition? :clueless:

zhout2 10-07-2016 07:13 PM

Re: Exercise 3.4
 
As \frac{1}{N}E(\epsilon^T H \epsilon)=\sigma ^2 (d+1),
The answer is really not \sigma ^2 (1-\frac{d+1}{N}), is it?

zhout2 10-10-2016 02:53 PM

Re: Exercise 3.4
 
Quote:

Originally Posted by zhout2 (Post 12451)
As \frac{1}{N}E(\epsilon^T H \epsilon)=\sigma ^2 (d+1),
The answer is really not \sigma ^2 (1-\frac{d+1}{N}), is it?

Never mind. It's just a typo in the original post. The answer is still correct.

johnwang 10-13-2017 09:50 AM

Re: Exercise 3.4
 
I still don't understand why "eq1" leads to zero. I know that e_i and e_j are zero mean independent variables. However, H_ij is dependent on both e_i and e_j,, so I don't know how to prove that the sum of H_ij*e_i*e_j has an expected value of zero.

Quote:

Originally Posted by tomaci_necmi (Post 11678)
In my textbook, there is a statement mentioned on the topic of linear regression/machine learning, and a question, which is simply quoted as,

Consider a noisy target, y = (w^{*})^T \textbf{x} + \epsilon, for generating the data, where \epsilon is a noise term with zero mean and \sigma^2 variance, independently generated for every example (\textbf{x},y). The expected error of the best possible linear fit to this target is thus \sigma^2.

For the data D =  \{ (\textbf{x}_1,y_1), ..., (\textbf{x}_N,y_N)  \}, denote the noise in y_nas \epsilon_n, and let \mathbf{\epsilon}   = [\epsilon_1, \epsilon_2, ...\epsilon_N]^T; assume that X^TX is invertible. By following the steps below, ***show that the expected in-sample error of linear regression with respect to D is given by***,

\mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})


Below is my methodology,


Book says that,

In-sample error vector, \hat{\textbf{y}} - \textbf{y}, can be expressed as (H-I)\epsilon, which is simply, hat matrix, H= X(X^TX)^{-1}X^T, times, error vector, \epsilon.

So, I calculated in-sample error, E_{in}( \textbf{w}_{lin} ), as,

E_{in}( \textbf{w}_{lin} ) = \frac{1}{N}(\hat{\textbf{y}} - \textbf{y})^T (\hat{\textbf{y}} - \textbf{y}) =  \frac{1}{N}  (\epsilon^T (H-I)^T (H-I) \epsilon)

Since it is given by the book that,

(I-H)^K = (I-H), and also (I-H) is symetric, trace(H) = d+1

I got the following simplified expression,

E_{in}( \textbf{w}_{lin} ) =\frac{1}{N}  (\epsilon^T (H-I)^T (H-I) \epsilon) = \frac{1}{N} \epsilon^T (I-H) \epsilon = \frac{1}{N} \epsilon^T \epsilon - \frac{1}{N} \epsilon^T H \epsilon


Here, I see that,

\mathbb{E}_D[\frac{1}{N} \epsilon^T \epsilon] = \frac {N \sigma^2}{N}

And, also, the sum formed by - \frac{1}{N} \epsilon^T H \epsilon, gives the following sum,

- \frac{1}{N} \epsilon^T H \epsilon = - \frac{1}{N} \{ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j \}

I undestand that,

- \frac{1}{N} \mathbb{E}_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] = - trace(H) \ \sigma^2 = - (d+1) \ \sigma^2


However, I don't understand why,

- \frac{1}{N} \mathbb{E}_D[\sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j ] = 0 \ \ \ \ \ \ \ \ \ \ \ \ (eq \ 1)


(eq 1) should be equal to zero in order to satisfy the equation,



\mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})


***Can any one mind to explain me why (eq1) leads to a zero result ?***


johnwang 10-13-2017 09:55 AM

Re: Exercise 3.4
 
Is it because the noise is generated independently for each datapoint?


All times are GMT -7. The time now is 05:43 AM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.