Are there any classification algorithms which target data with a one to many (1:n) relationship?

2023-02-06 06:52 问答作者：

Has there been any research in the field of data-mining regarding classifying data which has a one to many relationship?

For example of a problem like this, say I am trying to predict which students are going to drop out of university based on their class grades and personal information. Obviously there is a one to many relationship between the students personal information and the grades they achieved in their classes.

Obvious approaches include:

Aggregate - The multiple records could be aggregated together in some way reducing the problem to a basic classification problem. In the case of the student classification, the average of their grades could be combined with their personal data. While this solution is simple, often key information is lost. For example what if most students who take organic chemistry and get below a C- end up dropping out even if their average is above a B+ rating.
Voting - Create multiple classifiers (often weak ones) and have them cast votes to determine the overall class of the data in question. This would be like if two classifiers were built, one for the student's course data and one for their personal data. Each course record would be passed to the course classifier and based on the grade a开发者_运维技巧nd the course name, the classifier would predict whether the student would drop out using that course record alone. The personal data record would be classified using the personal data classifier. Then all the class record predictions along with the personal information record prediction would be voted together. This voting could be done in a number of different ways, but most likely would take into account how accurate the classifiers are and how certain the classifier was of the vote. Clearly this scheme allows for more complicated classification patterns than aggregation, yet there is a lot of extra complexity involved. Also if the voting is not performed well, accuracy can easily suffer.

So I am looking for other possible solutions to the classification of data with a one to many relationship.

Why wouldn't you treat each grade as a separate feature of the same model?

student['age'] = 23
student['gender'] = 'male'
 ... 
student['grade_in_organic_chemistry'] = 'B+'
student['grade_in_classical_physics'] = 'A-'

I guess I'm not seeing why you would want to "aggregate" or join together multiple classifiers when the grades can just be distinct features?

(Please excuse the lame psuedocode above, but just trying to demonstrate my point)

While this is probably sub-optimal compared to specialized methods, you could probably use an SVM with correction for unbalanced class as in the following example (using the Python library scikit-learn):

http://scikit-learn.sourceforge.net/auto_examples/svm/plot_weighted_classes.html

In practice, I have had good results with fairly unbalanced classes.

It's hard to say without knowing more, but from the Bayesian perspective, you may be interested in the case of missing features. I'll discuss in general terms. For more, see [Duda and Hart, 2nd ed., pp. 54-55].

For any classifier, the Bayes decision rule is to choose class i which maximizes the probability of class i occurring given that the data x was observed, i.e., max P(i|x). The vector x contains features, e.g., a student's grades, age, etc.

Not all students take the same classes, so the feature vector x may have empty elements, i.e., "missing features". In that case, you must marginalize over the missing features, i.e., just sum over the missing features, and then make a decision upon the good, remaining features.

Example. Suppose a student took biology, but not chemistry:

P(student drops out | A+ in biology) 
= P(student drops out, A+ in biology)/P(A+ in biology) 
= P(student drops out, A+ in biology, A in chemistry)
  ---------------------------------------------------
  P(A+ in biology, A in chemistry) 
  + 
  P(student drops out, A+ in biology, B in chemistry)
  ---------------------------------------------------
  P(A+ in biology, B in chemistry) 
  + ... + 
  P(student drops out, A+ in biology, F in chemistry)
  ---------------------------------------------------
  P(A+ in biology, F in chemistry)

I envision two basic paths forward:

As you call it, the "aggregate" solution, which would utilize various summaries of each student's situation: how many classes were taken, what percent of classes were introductory 101 classes, average grade, lowest quartile grade, etc.
Some type of evidence accumulator, such as a naive Bayes model (as already suggested by Steve) or a fuzzy logic rule base. Such solutions naturally handle varying amounts of incoming data. I suppose this could be achieved with enough data, using one giant conventional model (neural network, etc.) and a very large set of inputs (most of which would be set to a neutral value for "missing"), but I doubt it would work as well as other options.

Sorry, but I think the "gang of simple solutions" would be weak in this particular case. That's not to say that it wouldn't work, but I'd start somewhere else.

继续阅读：algorithm classification data-mining database-relations machine-learning

Are there any classification algorithms which target data with a one to many (1:n) relationship?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？