Why getting different results with MALLET topic inference for single and batch of documents?
I'm trying to perform LDA topic modeling with Mallet 2.0.7. I can train a LDA model and get good results, judging by the output from the training session. Also, I can use the inferencer built in that process and get similar results when re-processing my training file. However, if I take an individual file from the larger training set, and process it with the inferencer I get very different results, which are not good.
My understanding is that the inferencer should be using a fixed model, and only features local to that document, so I do not understand why I would get any different results while processing 1 file or the 1k from my training set. I am not doing frequency cutoffs which would seem to be a global operation that would have this type of an effect. You can see other parameters I'm using in the commands below, but they're mostly default. Changing # of iterations to 0 or 100 didn't help.
Import data:
bin/mallet import-dir \
--input trainingDataDir \
--output train.data \
--remove-stopwords TRUE \
--keep-sequence TRUE \
--gram-sizes 1,2 \
--keep-sequence-bigrams TRUE
Train:
time ../bin/mallet train-topics
--input ../train.data \
--inferencer-filename lda-inferencer-model.mallet \
--num-top-words 50 \
--num-topics 100 \
--num-threads 3 \
--num-iterations 100 \
--doc-topics-threshold 0.1 \
--output-topic-keys topic-keys.txt \
--output-doc-topics doc-topics.txt
Topics assigned during training to one file in particular, #14 is about wine which is correct:
998 file:/.../29708933509685249 14 0.31684981684981683
> grep "^14\t" topic-keys.txt
14 0.5 wine spray cooking car climate top wines place live honey sticking ice prevent collection market hole climate_change winery tasting california moldova vegas horses converted paper key weather farmers_market farmers displayed wd freezing winter trouble mexico morning spring earth round mici torrey_pines barbara kinda nonstick grass slide tree exciting lots
Run inference on entire train batch:
../bin/mallet infer-topics \
--input ../train.data \
--inferencer lda-inferencer-model.mallet \
--output-doc-topics inf-train.1 \
--num-iterations 100
Inference score on train -- very similar:
998 /.../29708933509685249 14 0.37505087505087503
Run inference on another training data file comprised of only that 1 txt file:
../bin/mallet infer-topics \
--input ../one.data \
--inferencer lda-inferencer-model.mallet \
--output-doc-topics inf-one.2 \
--num-iterations 100
Inference on one document produces topic 80 and 36, which are very different (14 is given near 0 score):
0 /.../29708933509685249 80 0.3184778184778185 36 0.19067969067969068
> grep "^80\t" topic-keys.txt
80 0.5 tips dog care pet saf开发者_运维问答ety items read policy safe offer pay avoid stay important privacy services ebay selling terms person meeting warning poster message agree sellers animals public agree_terms follow pets payment fraud made privacy_policy send description puppy emailed clicking safety_tips read_safety safe_read stay_safe services_stay payment_services transaction_payment offer_transaction classifieds_offer
The problem was incompatibility between small.data and one.data training data files. Even though I had been careful to use all of the same options, two data files will by default use different Alphabets (mapping between words and integers). To correct this, use the --use-pipe-from [MALLET TRAINING FILE] option, and then specifying other options seems to be unnecessary. Thanks to David Mimno.
bin/mallet import-dir \
--input [trainingDataDirWithOneFile] \
--output one.data \
--use-pipe-from small.data
精彩评论