开发者

Supervised learning with multiple sources of training data

I'm not sure it's the right exchange site for machine learning questions but I did see ML questions before so I'm trying my luck (also posted at http://math.stackexchange.com).

I have training instances that come from different sources so building one model doesn't work well. Is there a known method to use in such cases?

Example explains best. Let's say I want to classify cancer/non-ca开发者_JAVA技巧ncer given training data that was constructed based on different populations. Training instances from one population might have a completely different distribution of positive/negative examples than in other populations. Now, I can build a separate model for each population, but the problem is that for testing I don't know from which population the test instance is coming from.

*all training/testing instances have the exact same feature set regardless of the population they came from.


I suspect that this might not work any better than just throwing all your data into a single classifier trained on the whole set. From a high level, the features of the data set should tell you the labels, not the input distribution. But you could try it.

Train a separate classifier for each data set that tries to predict the label. Then train a classifier on the combined distribution, that tries to predict which dataset the data point came from. Then when you want to predict the label for a test instance, use each sub classifier, and give it weight proportional to the weight assigned by the high level data set classifier.

This feels a lot like the estimation step in mixture of Gaussian, where you assign a probability of generating a data point by taking a probability weighted average assigned by estimates from the K centers.


A classical approach for this is via hierarchical modeling (if you can have hierarchies), fixed effects models (or random effects, depending on the assumptions and circumstances), various other group or structural models.

You can do the same in a machine learning context by describing the distributions as a function of the source, both in terms of the sample populations and the response variable(s). Thus, source is essentially a feature that could potentially interact with all (or most) of the other features.

The bigger question is whether your future (test) data will come from one of these sampling populations or yet another population.

Update 1: If you want to focus on machine learning, rather than statistics, another related concept to look into is transfer learning. It's not terribly complicated, though it is rather hyped. The basic idea is that you find common properties in the auxiliary data distributions that can be mapped into the predictor/response framework of the target data source. In another sense, you're looking for a way to exclude source-dependent variation. These are very high level descriptions, but should help in your reading plans.


If you are only interested in prediction (which I think, because you are talking about supervised learning), then there is nothing wrong with mixing the datasets and train a joint model.

If you are using models like SVMs, Neural Networks or logistic regression, it might help to add another feature indicating to what population a sample belongs. Once you get an unseen sample, set this feature to a neutral value (eg. use -1 for pop 1, +1 for pop2, 0 for unseen samples).

You can then very easily inspect what difference the two populations make.


A naive idea would be : If you have same features for train/test set, you can construct a separate classifier for each population. You can just feed your test set to the ensemble and see if the classifier that matches with target population of a test instance performs better and all other classifier worse( or you can learn some kind of difference).

Can you build a separate classifier to predict which population an instance belongs to? If yes, you can use it as your pre-classification and perform later stuff.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜