View Single Post
Old 06-03-2016, 07:29 AM
Nick Torenvliet Nick Torenvliet is offline
Junior Member
Join Date: Apr 2016
Posts: 2
Default Q5 Least Squares Behaviour

wrt Q5

I've written a python script with some matplotlib to visualize and compare the various f and g in the 1000 run simulation.

In terms of process...
1- choose a population N of 100 random points (x1,x2) where x1 and x2 are >-1, <+1
2- solve for f_m and f_b of a line joining another two similarly chosen random points
3- classify points in N as +1 or -1 based on comparison of x2 and f_m*x1+ f_b to get vector of classifications f_y
4- perfom a linear least squares regression with numpy.linalg.lstsq and get g_m and g_b
5- classify points in N as +1 or -1 based on comparison of x2 and g_m*x1+ g_b to get vector of classifications g_y
6- compare f_y and g_y to get E_in
7- repeat step 1-6 1000 times to get average E_in

I am finding that when N cuts f such that there are very many of one class and very few of the other, then g will often miss-classify all of the smaller set in favor of properly classifying all the larger set.

Sometimes g will lie completely outside of the viewing window bounded by +2, -2 all around.

That g might miss-classify all of the smaller set, in these imbalanced cases, I can accept... I think. That g would lie very far away from the box bounded by +1,-1 all around troubles me. Am I right to think something is wrong here?

The error is large enough to lead to the wrong answer for question 5, but only by a hair.

I did a fair amount of debugging, I cannot see any anything other than the sometimes large variance between the f_m/f_b and g_m/g_b that the linear solver spits out when there is a large class imbalance.
Reply With Quote