
#1




When to use normalization?
Hi all and Prof. Yaser,
Machine learning practitioners use to say that sometimes the input data should be normalized before an algorithm is trained on it. So, when should we normalize our input data? Put it another way, do all machine learning algorithms require normalization? If not, which ones require? And finally why is there a need for normalization? Thanks bunches ! 
#2




Re: When to use normalization?
In general there is nothing lost in normalizing the data, and it can help various optimization algorithms.
You need to normalize the data for any algorithm that treats the inputs on an equal footing. For example an algorithm which uses the Euclidean distance (such as the Support Vector Machine) treats all the inputs on the same footing. You should not normalize the data if the scale of the data has significance. For example if income is twice as important as debt in credit approval, then it is appropriate for income to have twice the size as debt. Or, if you do normalize the inputs in this case, then you should take this difference in importance into account some other way. One important precaution when normalizing the data: if you are using something like validataion to estimate your test error, always normalize only the training data, and use the resulting normalization parameters to rescale the validation data. If you do not follow this strict prescription, then your validation estimate will not be legitimate. Quote:
__________________
Have faith in probability 
#3




Re: When to use normalization?
Quote:

#4




Re: When to use normalization?
Curiosity, Thanks for asking this question, and Prof Magdon, for his reply. A while ago, similar thoughts crossed my mind too.
 When we talk about normalization, are we talking about about getting rid of the "units" of the data? For eg., if the input vector has weight & height features, do we scale to effectively get rid of the kg and meter units, by say the average of weight and heights respectively (or some constant)? Is this what you mean by getting them on a equal footing?  You caution "scaling" in a relatively sense. Generally speaking, is this to suggest that a cavalier application of simple normalization can distort the correlative structure implicit in the (original) input data?  Your use of the word scaling raise another question in my mind. Does it make sense to keep an eye on whether the features of input data have disparate "ranges"? Say, one feature ranges from [0,1], and another from [1,1000]. Does it make sense to reduce the "range" of the later to make it comparable to the range of the other feature? _ I tried to think through your comments in the context of supervised vs unsupervised learning. In a regression situation, we have LHS & RHS, and I suppose one could possibly be more cavalier about normalization, as long as it is done consistently across the system. However, for unsupervised learning, my immediate thoughts are that one needs to a lot more careful about relatively scaling between features. Roughly speaking, is my suspicion right? Related to these questions is a nagging concern whether one unduly gives insignificant features importance by bringing them on a "equal footing"? Thank you for your comments & thoughts.
__________________
The whole is simpler than the sum of its parts.  Gibbs 
#5




Re: When to use normalization?
When I said normalize, I meant place the data into some normal form, like having the same "scale"
Here is an example to help Suppose you have three points: x=(1,2),(1,2),(3,2) y=+1,1,+1 One way to normalize the data is to have the average squared value of each coordinate equal to 1. You would divide the first xcoordinate by and the second coordinate by . Now both coordinates are "normalized" so that the average squared value is 1. Suppose instead you wanted to use the third point as a test point. Now you normalize the first 2 points. In this case you dont change the first coordinate and divide the second coordinate by 2, to get the normalized data. You learn on this normalized training data of 2 points and test the learned hypothesis on the 3rd point. Before you test the learned hypothesis, you need to rescale the test point with the same rescaling parameters that you used to normalize the 2 training data points. Quote:
__________________
Have faith in probability 
#6




Re: When to use normalization?
Thanks, Dr. MagdonIsmail for the example. However, I'm still not sure I understand exactly why we must use the same rescaling parameters for the training data. I guess I could see that if we were doing a simple log transform (e.g. if you used base10 on training data you certainly wouldn't want to use base2 on the test date), but in your example you are transforming the data to fit a certain criteria (avg sq value of each coordinate = 1). Would your model then expect a new data set to have the same quality? If we apply the exact rescaling parameters that we used on the training set to the test set, it certainly won't meet that criteria. Thanks for your help!

#7




Re: When to use normalization?
You are right, scaling can be any transformation. If you used some transformation to learn on the training data, you must use the same transformation when you test. Here is a simple idealized setting with your log transform and with simple scaling. Suppose the problem is 1dim regression:
x: 2,4,6. y: 6,12,18. xtest=8 ytest=24 It is easy to see the relationship is y=3x. We can succesfully learn this from the training data. Now suppose we rescaled the xdata by 0.5 in the training: x=1,2,3 y=6,12,18 What is the relationship you would learn: y=6x Now try to apply this to the test data: , because you did not rescale the test data in exactly the way you did the training data. If you also rescale the test datum, then xtest becomes 4 and indeed the function you learned works: ytest=6 xtest'. Lets see what happens with the log transform: the "rescaled", i.e. transformed xdata become: x=log2,log4,log6 y=6,12,18 What is the relationship you would learn: If you simply apply this to the test point it will fail: . You must first transform the test point to xtest'=log8. Now it is indeed the case that your learned function will work: The thing to realize is that when you rescale the training data and then learn, the learning takes into account the scaling and the hypothesis learned will depend on what scaling is used as the examples above illustrate. In other words, the hypothesis works for any data point (training or test) only after the scaling is applied. Quote:
__________________
Have faith in probability 
Tags 
input normalization, normalization 
Thread Tools  
Display Modes  

