machine learning and code generator from strings

2022-12-30 09:33 问答作者：

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.

The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.

(Please don't get stuck on details as the real details would restrict an开发者_运维知识库swers to less generally useful responses AND are under NDA.)

As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.

Doing Things Manually

If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.

This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.

Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.

Document Classification

Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:

Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.

Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.

C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.

Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.

Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.

Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with

Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.

You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

继续阅读：classification code-generation decision-tree machine-learning

machine learning and code generator from strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？