开发者

How to find which category belong to an offer with only the title?

I am developing a new service that will query multiple offers (Groupon, etc. ..) and I would like to decipher which cate开发者_运维百科gory belongs to this offer.

Example:

I get this title: "Acqualina Wellness Expo – Acqualina Resort & Spa" and I need to find out what category belongs to this offer.

I try play with http://www.google.com/insights/search/ but it's not easy because it receives only 7 parameters (terms) and sometimes we have compound words that cannot be separated.


There are fun methods based on Wordnet and search distance and such, but the standard way would be the Bayesian spam filter approach.

Step 1: Construct an example set of title (or title and body) and what category you think it belongs to. The larger and more diverse you make this set the better. You need to have many (let's say at least a two-digit number, but preferably hundreds) different examples from each category you want to be able to recognize. If you want help constructing this set, you could use Amazon's Mechanical Turk and pay other people to do the categorization.

Step 2: Run all your examples by CRM114 (http://crm114.sourceforge.net/ ) or something similar. If you want to use a cloud service, I think the Google Prediction API allows for text fields.

Step 3: For testing, don't let the categorizer see all examples. Keep some in what is called an out-of-sample set, that you can test your categorizer on. It is much easier for it to categorize stuff it has already seen, so you want to make sure that you know how good it is on unseen examples. Some categorizers will do this test for you automatically.

Good luck!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜