svmlib scaling vs. pyml normalization, scaling, and translation
What is the proper way to normalize feature vectors for use in a linear-kernel SVM?
Looking at LIBSVM, it looks like it's done by just rescaling each feature to a single standard upper/lower range. However, it doesn开发者_StackOverflow社区't seem like PyML provides a way to scale the data this way. Instead, there are options to normalize the vectors by their length, shift each feature value by its mean while rescaling by the standard deviation, etc.
I am dealing with a case when most features are binary, except a few that are numeric.
I am not an expert in this, but I believe centering and scaling each feature vector by subtracting its mean and dividing thereafter by the standard deviation is a typical way to normalize feature vectors for use with SVMs. In R, this can be done with the scale function.
Another way is to transform each feature vector to the [0,1] range:
(x - min(x)) / (max(x) - min(x))
Maybe some features could benefit from a log-transformation if the distribution is very scewed, but this would change the shape of the distribution as well and not only "move" it.
I am not sure what you gain in an SVM-setting by normalizing the vectors by their L1 or L2 norm like PyML does with its normalize method. I guess binary features (0 or 1) don't need to be normalized.
精彩评论