Why getting different results with MALLET topic inference for single and batch of documents?

2023-04-10 22:09 问答作者：

I'm trying to perform LDA topic modeling with Mallet 2.0.7. I can train a LDA model and get good results, judging by the output from the training session. Also, I can use the inferencer built in that process and get similar results when re-processing my training file. However, if I take an individual file from the larger training set, and process it with the inferencer I get very different results, which are not good.

My understanding is that the inferencer should be using a fixed model, and only features local to that document, so I do not understand why I would get any different results while processing 1 file or the 1k from my training set. I am not doing frequency cutoffs which would seem to be a global operation that would have this type of an effect. You can see other parameters I'm using in the commands below, but they're mostly default. Changing # of iterations to 0 or 100 didn't help.

Import data:

bin/mallet import-dir \
  --input trainingDataDir \
  --output train.data \
  --remove-stopwords TRUE \
  --keep-sequence TRUE \
  --gram-sizes 1,2 \
  --keep-sequence-bigrams TRUE

Train:

time ../bin/mallet train-topics
  --input ../train.data \
  --inferencer-filename lda-inferencer-model.mallet \
  --num-top-words 50 \
  --num-topics 100 \
  --num-threads 3 \
  --num-iterations 100 \
  --doc-topics-threshold 0.1 \
  --output-topic-keys topic-keys.txt \
  --output-doc-topics doc-topics.txt

Topics assigned during training to one file in particular, #14 is about wine which is correct:

998 file:/.../29708933509685249 14  0.31684981684981683 
> grep "^14\t" topic-keys.txt 
14  0.5 wine spray cooking car climate top wines place live honey sticking ice prevent collection market hole climate_change winery tasting california moldova vegas horses converted paper key weather farmers_market farmers displayed wd freezing winter trouble mexico morning spring earth round mici torrey_pines barbara kinda nonstick grass slide tree exciting lots

Run inference on entire train batch:

../bin/mallet infer-topics \
  --input ../train.data \
  --inferencer lda-inferencer-model.mallet \
  --output-doc-topics inf-train.1 \
  --num-iterations 100

Inference score on train -- very similar:

998 /.../29708933509685249 14 0.37505087505087503

Run inference on another training data file comprised of only that 1 txt file:

../bin/mallet infer-topics \
  --input ../one.data \
  --inferencer lda-inferencer-model.mallet \
  --output-doc-topics inf-one.2 \
  --num-iterations 100

Inference on one document produces topic 80 and 36, which are very different (14 is given near 0 score):

0 /.../29708933509685249 80 0.3184778184778185 36 0.19067969067969068
> grep "^80\t" topic-keys.txt 
80  0.5 tips dog care pet saf开发者_运维问答ety items read policy safe offer pay avoid stay important privacy services ebay selling terms person meeting warning poster message agree sellers animals public agree_terms follow pets payment fraud made privacy_policy send description puppy emailed clicking safety_tips read_safety safe_read stay_safe services_stay payment_services transaction_payment offer_transaction classifieds_offer

The problem was incompatibility between small.data and one.data training data files. Even though I had been careful to use all of the same options, two data files will by default use different Alphabets (mapping between words and integers). To correct this, use the --use-pipe-from [MALLET TRAINING FILE] option, and then specifying other options seems to be unnecessary. Thanks to David Mimno.

bin/mallet import-dir \
  --input [trainingDataDirWithOneFile] \
  --output one.data \
  --use-pipe-from small.data

继续阅读：machine-learning mallet topic-modeling

Why getting different results with MALLET topic inference for single and batch of documents?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？