LFD Book Forum  

Go Back   LFD Book Forum > Course Discussions > Online LFD course > Homework 2

Reply
 
Thread Tools Display Modes
  #1  
Old 06-03-2016, 07:29 AM
Nick Torenvliet Nick Torenvliet is offline
Junior Member
 
Join Date: Apr 2016
Posts: 2
Default Q5 Least Squares Behaviour

wrt Q5

I've written a python script with some matplotlib to visualize and compare the various f and g in the 1000 run simulation.

In terms of process...
1- choose a population N of 100 random points (x1,x2) where x1 and x2 are >-1, <+1
2- solve for f_m and f_b of a line joining another two similarly chosen random points
3- classify points in N as +1 or -1 based on comparison of x2 and f_m*x1+ f_b to get vector of classifications f_y
4- perfom a linear least squares regression with numpy.linalg.lstsq and get g_m and g_b
5- classify points in N as +1 or -1 based on comparison of x2 and g_m*x1+ g_b to get vector of classifications g_y
6- compare f_y and g_y to get E_in
7- repeat step 1-6 1000 times to get average E_in

I am finding that when N cuts f such that there are very many of one class and very few of the other, then g will often miss-classify all of the smaller set in favor of properly classifying all the larger set.

Sometimes g will lie completely outside of the viewing window bounded by +2, -2 all around.

That g might miss-classify all of the smaller set, in these imbalanced cases, I can accept... I think. That g would lie very far away from the box bounded by +1,-1 all around troubles me. Am I right to think something is wrong here?

The error is large enough to lead to the wrong answer for question 5, but only by a hair.

I did a fair amount of debugging, I cannot see any anything other than the sometimes large variance between the f_m/f_b and g_m/g_b that the linear solver spits out when there is a large class imbalance.
Reply With Quote
  #2  
Old 06-13-2016, 04:52 PM
Nick Torenvliet Nick Torenvliet is offline
Junior Member
 
Join Date: Apr 2016
Posts: 2
Smile Re: Q5 Least Squares Behaviour

So... haha... systematic error.

Just as another student (sandeep) was, I was getting an average E_in on 1000 trials of 100 of ~ 0.13

I believe this is indicative not accounting for the case where the slope of the linear regression solution g is opposite sign of that of the target function f.

In a naive approach to classification... you will get 100% error in that case. The case seems to occur predictably enough to bias the correct answer you might get to ~0.13.

Oddly enough... I am very satisfied with that... at least it confirms laws of large numbers.

Everything does though...
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 06:18 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.