A good machine learning technique to weed out good URLs from bad
I have an application that needs to discriminate between good HTTP GET requests and bad.
For example:
http://somesite.com?passes=dodgy+parameter # BAD
http://anothersite.com?passes=a+good+parameter # GOOD
My system can make a binary decision about whether or not a URL is good or bad - but ideally I would like it to predict whether or not a previously unseen URL is good or bad.
http://some-new-site.com?passes=a+really+dodgy+parameter # BAD
I feel the need for a support vector machine (SVM) ... but I need to learn machine learning. Some questions:
1) Is an SVM appropriate for this task? 2) Can I train it with the raw URLs? - without explicitly specifying 'features' 3) How many URLs will I need for it to be good at predictions? 4) What kind of SVM kernel should I use? 5) After I train it, how do I keep it up to date? 6) How do I test unseen URLs a开发者_开发问答gain the SVM to decide whether it's good or bad? I
I think that steve and StompChicken both make excellent points:
- Picking the best algorithm is tricky, even for machine learning experts. Using a general-purpose package like Weka will let you easily compare a bunch of different approaches to determine which works best for your data.
- Choosing good features is often one of the most important factors in how well a learning algorithm will work.
It could also be useful to examine how other people have approached similar problems:
- Qi, X. and Davison, B. D. 2009. Web page classification: Features and algorithms. ACM Computing Survey 41, 2 (Feb. 2009), 1-31.
- Kan, M.Y. and H.O.N. Thi (2005). Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM ’05), New York, NY, pp. 325–326.
- Devi, M. I., Rajaram, R., and Selvakuberan, K. 2007. Machine Learning Techniques for Automated Web Page Classification Using URL Features. In Proceedings of the international Conference on Computational intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02 (December 13 - 15, 2007). Washington, DC, pp. 116-120.
I don't agree with steve that an SVM is a bad choice here, although I also don't think there's much reason to think it will do any better than any other discriminative learning algorithm.
You are going to need to at least think about designing features. This is one of the most important parts of making a machine learning algorithms work well on a certain problem. It's hard to know what to suggest without more idea of the problem. I guess you could start with counts character n-grams present in the URL as features.
Nobody really knows how much data you need for any specific problem. The general approach is to get some data, learn a model, see if more training data helps, repeat until you don't get any more significant improvement.
Kernels are a tricky business. Some SVM libraries has string kernels which allow you to train on strings without any feature extraction (I'm thinking of SVMsequel, there may be others). Otherwise, you need to compute numerical or binary features from your data and use the linear, polynomial or RBF kernel. There's no harm in trying them all and it's worth spending some time finding the best settings for the kernel parameters. Your data is also obviously structured and there's no point in letting the learning algorithm try and figure the structure of URLs (unless you care about invalid URLs). You should at least split the URL up according to the separators '/', '?', '.', '='.
I don't know what you mean by 'keep it up to date'. Retrain the model with whatever new data you have.
This depends on the library you use, in svmlight there is a program called svm_classify that takes a model and an example and gives you a class label (good or bad). I'm sure it's going to be straightforward to do in any library.
If I understand correctly you just want to learn if a URL is good or bad.
A SVM is not appropriate, SVM's are only appropriate if the dataset is very complex and many of the information points are close to the hyperplane. You'd use a SVM to add extra dimensions to the data.
You'd want a few thousand URL's ideally to train your dataset. The more the better, obviously you could do it with just 100 but your results may not produce good classifications.
I'd suggest you build your data set first and use Weka http://www.cs.waikato.ac.nz/ml/weka/
You can measure which algorithm gives you the best results.
what dataset will you be using for training , if you have a good dataset, SVM will do good I believe with a good penalty factor. If there is no dataset I would suggest to use online algorithms like kNN or even perceptrons.
精彩评论