On the 'linear model I'',one step learning in not what linear...
hi,hope enjoy your life.(do you ever think why should you be on the earth instead of some other guy?!)
As the topic shows I want to spend some time arguing about this lecture.
ok,how should I say?well,on step learning is not what linear regression deserves. I strongly believe that linear regression has the same learning importance and beauty as PLA,and even more!
I will ,in details,advance my approach which is adopted from linear algebra.
first,in the simplest case,consider three distinct points in xy plane.Our job is to find the best line approximates a nonexisting line which passes through them.
now,I say that our first sparks of learning start in the process of finding the best line.For instance,one tests some line which pass through 2 points,one point or non of the points ,and realizes that to obtain the best line,he/she must come up with some error function that takes the error contributions of all points to account.So,we came up with least square error function,which elegantly handle the errors of all points,and more importantly its derivative is linear.
On the other hand,linear algebra says that the 'right hand side vector' namely b ,is not in vector space of the column space of A ,where A is a 3 by 3 matrix .( AX=b)
as predicted 3 distinct points do not fit on the same line.Linear algebra suggests that instead of b,we can approximate x,by introducing the error vector e,AXb=e,and projecting b in the column space which is the closest point two the b and minimizes the error vector e.On important learning process is that the error vector is orthogonal to column space and this brilliant observation leads to find projection point and consequently the x^ solution which approximates the parameters of the best line.
in addition,I want two discuss about a beautiful learning approach to find the optimized line in linear regression,gradient descent algorithm which uses the calculus ideas to optimize the parameters of the line.This approach simply picks a random point at first,and find the deepest direction to take its little baby steps,until finally finds its local minimum. Actually when we concluded that e is orthogonal to column space intuitively,we used this optimization idea in one step by taking the partial derivatives of error function.
but what I believe is : before derivative we learn the gradient descent algorithm.
At the end of my discussion I want to talk about properties of the magical matrix A'A(A'=A transpose). Firstly, it is symmetric and more importantly squared. One magical property which Prof. also related in his lecture, is if A is a full rank matrix,its columns are independent,then A'A is invertible!,and since our data points are randomly chosen,A'A is virtually invertible.
here is its proof:
suppose A'Ax=o;
multiplying both sides by x',we've got (xA)' Ax=0;which suggests that Ax must be zero,and notice important fact that Ax is not zero for any x except x=0 vector!
QED.
I sincerely want to know your thought about this.
best regards.
