开发者

How to classify text when pre defined categories are not available

I have a problem and not getting idea which algorithm have to apply. I am thinking to apply clustering in case two but no idea on case one:

I have .5 million credit card activity documents. Each document is well defined and contains 1 transaction per line. The date, the amount, the retailer name, and a short 5-20 word description of the retailer. Sample: 2004-11-47,$500,Amazon,An online retailer providing goods and services including books, hardware, music, etc. Questions: 1. How would classify each entry given no pre defined categories. 2. How would do this if you were given pre defined categories such as "restaurant", "entertainment", 开发者_运维技巧etc.


1) How would classify each entry given no pre defined categories.

You wouldn't. Instead, you'd use some dimensionality reduction algorithm on the data's features to them in 2-d, make a guess at the number of "natural" clusters, then run a clustering algorithm.

2) How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.

You'd manually label a bunch of them, then train a classifier on that and see how well it works with the usual machinery of accuracy/F1, cross validation, etc. Or you'd check whether a clustering algorithm picks up these categories well, but then you still need some labeled data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜