Are there any classification algorithms which target data with a one to many (1:n) relationship?
Has there been any research in the field of data-mining regarding classifying data which has a one to many relationship?
For example of a problem like this, say I am trying to predict which students are going to drop out of university based on their class grades and personal information. Obviously there is a one to many relationship between the students personal information and the grades they achieved in their classes.
Obvious approaches include:
Aggregate - The multiple records could be aggregated together in some way reducing the problem to a basic classification problem. In the case of the student classification, the average of their grades could be combined with their personal data. While this solution is simple, often key information is lost. For example what if most students who take organic chemistry and get below a C- end up dropping out even if their average is above a B+ rating.
Voting - Create multiple classifiers (often weak ones) and have them cast votes to determine the overall class of the data in question. This would be like if two classifiers were built, one for the student's course data and one for their personal data. Each course record would be passed to the course classifier and based on the grade a开发者_运维技巧nd the course name, the classifier would predict whether the student would drop out using that course record alone. The personal data record would be classified using the personal data classifier. Then all the class record predictions along with the personal information record prediction would be voted together. This voting could be done in a number of different ways, but most likely would take into account how accurate the classifiers are and how certain the classifier was of the vote. Clearly this scheme allows for more complicated classification patterns than aggregation, yet there is a lot of extra complexity involved. Also if the voting is not performed well, accuracy can easily suffer.
So I am looking for other possible solutions to the classification of data with a one to many relationship.
Why wouldn't you treat each grade as a separate feature of the same model?
student['age'] = 23
student['gender'] = 'male'
...
student['grade_in_organic_chemistry'] = 'B+'
student['grade_in_classical_physics'] = 'A-'
I guess I'm not seeing why you would want to "aggregate" or join together multiple classifiers when the grades can just be distinct features?
(Please excuse the lame psuedocode above, but just trying to demonstrate my point)
While this is probably sub-optimal compared to specialized methods, you could probably use an SVM with correction for unbalanced class as in the following example (using the Python library scikit-learn):
http://scikit-learn.sourceforge.net/auto_examples/svm/plot_weighted_classes.html
In practice, I have had good results with fairly unbalanced classes.
It's hard to say without knowing more, but from the Bayesian perspective, you may be interested in the case of missing features. I'll discuss in general terms. For more, see [Duda and Hart, 2nd ed., pp. 54-55].
For any classifier, the Bayes decision rule is to choose class i which maximizes the probability of class i occurring given that the data x was observed, i.e., max P(i|x). The vector x contains features, e.g., a student's grades, age, etc.
Not all students take the same classes, so the feature vector x may have empty elements, i.e., "missing features". In that case, you must marginalize over the missing features, i.e., just sum over the missing features, and then make a decision upon the good, remaining features.
Example. Suppose a student took biology, but not chemistry:
P(student drops out | A+ in biology)
= P(student drops out, A+ in biology)/P(A+ in biology)
= P(student drops out, A+ in biology, A in chemistry)
---------------------------------------------------
P(A+ in biology, A in chemistry)
+
P(student drops out, A+ in biology, B in chemistry)
---------------------------------------------------
P(A+ in biology, B in chemistry)
+ ... +
P(student drops out, A+ in biology, F in chemistry)
---------------------------------------------------
P(A+ in biology, F in chemistry)
I envision two basic paths forward:
As you call it, the "aggregate" solution, which would utilize various summaries of each student's situation: how many classes were taken, what percent of classes were introductory 101 classes, average grade, lowest quartile grade, etc.
Some type of evidence accumulator, such as a naive Bayes model (as already suggested by Steve) or a fuzzy logic rule base. Such solutions naturally handle varying amounts of incoming data. I suppose this could be achieved with enough data, using one giant conventional model (neural network, etc.) and a very large set of inputs (most of which would be set to a neutral value for "missing"), but I doubt it would work as well as other options.
Sorry, but I think the "gang of simple solutions" would be weak in this particular case. That's not to say that it wouldn't work, but I'd start somewhere else.
精彩评论