Supervised learning with multiple sources of training data

2023-04-01 22:25 问答作者：

I'm not sure it's the right exchange site for machine learning questions but I did see ML questions before so I'm trying my luck (also posted at http://math.stackexchange.com).

I have training instances that come from different sources so building one model doesn't work well. Is there a known method to use in such cases?

Example explains best. Let's say I want to classify cancer/non-ca开发者_JAVA技巧ncer given training data that was constructed based on different populations. Training instances from one population might have a completely different distribution of positive/negative examples than in other populations. Now, I can build a separate model for each population, but the problem is that for testing I don't know from which population the test instance is coming from.

*all training/testing instances have the exact same feature set regardless of the population they came from.

I suspect that this might not work any better than just throwing all your data into a single classifier trained on the whole set. From a high level, the features of the data set should tell you the labels, not the input distribution. But you could try it.

Train a separate classifier for each data set that tries to predict the label. Then train a classifier on the combined distribution, that tries to predict which dataset the data point came from. Then when you want to predict the label for a test instance, use each sub classifier, and give it weight proportional to the weight assigned by the high level data set classifier.

This feels a lot like the estimation step in mixture of Gaussian, where you assign a probability of generating a data point by taking a probability weighted average assigned by estimates from the K centers.

A classical approach for this is via hierarchical modeling (if you can have hierarchies), fixed effects models (or random effects, depending on the assumptions and circumstances), various other group or structural models.

You can do the same in a machine learning context by describing the distributions as a function of the source, both in terms of the sample populations and the response variable(s). Thus, source is essentially a feature that could potentially interact with all (or most) of the other features.

The bigger question is whether your future (test) data will come from one of these sampling populations or yet another population.

Update 1: If you want to focus on machine learning, rather than statistics, another related concept to look into is transfer learning. It's not terribly complicated, though it is rather hyped. The basic idea is that you find common properties in the auxiliary data distributions that can be mapped into the predictor/response framework of the target data source. In another sense, you're looking for a way to exclude source-dependent variation. These are very high level descriptions, but should help in your reading plans.

If you are only interested in prediction (which I think, because you are talking about supervised learning), then there is nothing wrong with mixing the datasets and train a joint model.

If you are using models like SVMs, Neural Networks or logistic regression, it might help to add another feature indicating to what population a sample belongs. Once you get an unseen sample, set this feature to a neutral value (eg. use -1 for pop 1, +1 for pop2, 0 for unseen samples).

You can then very easily inspect what difference the two populations make.

A naive idea would be : If you have same features for train/test set, you can construct a separate classifier for each population. You can just feed your test set to the ensemble and see if the classifier that matches with target population of a test instance performs better and all other classifier worse( or you can learn some kind of difference).

Can you build a separate classifier to predict which population an instance belongs to? If yes, you can use it as your pre-classification and perform later stuff.

继续阅读：artificial-intelligence machine-learning

Supervised learning with multiple sources of training data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？