Interactive Decision Tree Classifier

2023-01-07 15:40 问答作者：

Can 开发者_运维问答anyone recommend a decision tree classifier implementation, in either Python or Java, that can be used incrementally?

All the implementations I've found require you to provide all the features to the classifier at once in order to get a classification. However, in my application, I have hundreds of features, and some of the features are functions that can take a long time to evaluate. Since not all branches of the tree may use all the features, it doesn't make sense to give the classifier all the features at once. I'd like the classifier to ask for features, one at a time, in the order that they are needed to make the maximum reduction in entropy and provide a final classification.

I believe there is no such implementation, but decision trees are so simple to implement that you shouldn't have any problems with writing such program on your own.
On the other hand I don't think the idea of counting features on the fly can increase speed, because even if some feature was used to make some previous split, it still must be considered on the rest of them, so for many records it would be recalculated many times (it may save memory though). This could make sense in case of random forest, where only a random, limited subset of features is considered on each split -- still RF is usable only as a classifier, it won't build you nice, human-interpretable decision trees.

Usually such packages (J48 trees in Weka in particular) allows you to specify a missing value in place of a feature value, which will be dealt with the same way C4.5 algorithm would do:

when we reach the node splitting by that attribute with missing value, we send the instance down each possible branch weighted proportional to the number of training instances going down those branches, eventually accumulating the result at the leaf nodes.

Of course you could have a more aggressive approach and change the way the tree chooses the attributes to split by during the training phase. A naive way would be assign weights to attributes and to multiply the splitting criterion (entropy, information gain, ..) by that weight as a penalty coefficient, so that "expensive attributes" would be less likely to be chosen as a splitting node.

One option might be gaenari a c++ incremental decision tree algorithm.

It is designed to minimize concept drifting damage.

it is based on C++, but can be extended to python and java.

Are you concerned about this during training time or at classification time? Since those periods are separate you can play tricks to avoid evaluating it until it's very late if it's the later. There are no trick you can play during training time. You have to provide all the features at the time of training. However, since that can happen outside your program you don't have to be as concerned about calculation time. Training the tree is the most intensive part.

So I'd suggest get all your data together, train it, take the product from training then using lazy evaluation in your objects as you send them down the tree. Have your objects implement some interface for getting values that you can use code to lazy evaluate things. If an object never hits a node requiring an expensive value then you don't have to evaluate it.

Your expensive calculations might not be chosen as picks to split on so then you eliminate the need to evaluate them at classification time. You'll probably only have 3-5 features that are statistically relevant once you train and prune your tree. Then you can optimize only those features with caching and such so classification is fast.

You if you want incremental training that's a whole another ball of wax, and there are algorithms for that. But, they aren't explained very well, and you'll have to dig into research papers to get them.

So here is what I'd do. Given the answer to my previous question I think you have something like the following. Sounds like you want to implement some sort of 20 questions like approach.

With twenty questions you have yes/no answers so a binary tree works best. However, you could layer in multiple choice options, but the user picks one choice. So this algorithm assumes you've trained your tree ahead of time and it's been built from a dataset you wish to use.

Say for example we're trying to do a medical diagnosis so our data might look like the following:

Disease Name  Head Ache   Fever  Back Pain  Leg Pain  Blurry Vision  Hearing Loss
Common Cold   Yes         Yes    No         No        No             No
Migraine      Yes         No     No         No        Yes            No
Herpes        No          Yes    No         No        No             No

In this example, Head Ache, Fever, Back Pain, Leg Pain, etc are the influencers, and Disease Name is the target. Each row would be an actual diagnosis of a single patient so a disease could be repeated in the data more than once.

Modify a walk algorithm to start at the root.
If you've reached a leaf tell the user the potential answers.
Take the influencer used to split this node and present it to the user and ask the "Yes/No" question (Do you have a Head Ache).
Go left if the user answers Yes.
Go Right if the user answers No.
Goto Step 2

In the leaf nodes you'll have to actual rows that made it to that location so you can display it to the user saying you might have one of these:

Head ache Migraine Severed Head

Prescription is: blah blah blah.

With 1 million influencers it will take a while to build the tree. If you wanted to lower that it might be possible to use multi-valued influencers instead of yes/no. Although it's really hard to think of 1 million yes/no unique questions even for every medical condition. Once you build the tree it can offer as many diagnosis as you want.

The Python decision tree package at https://pypi.python.org/pypi/DecisionTree has an interactive mode for using a decision tree. It is not incremental in the sense you need. However, it may be possible to easily change the code in the function for interactive operation to allow you to see the results incrementally.

继续阅读：data-mining decision-tree machine-learning

Interactive Decision Tree Classifier

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？