开发者

Where can I find texts that describe topic-specific events?

So, some background: I'm trying to train a ML system to answer questions about events, where both the event descriptions and questions are posed in natural language; the event descriptions are constrained to being single sentences.

So far the main problem with this has been locating a corpus that describes events with a limited enough vocabulary to pose similar questions across all of the events (e.g. if all of the events involved chess, I could reasonably ask 'what piece moved?' and an answer could be drawn from a decent percentage of the event description sentences).

With that in mind, I'm hoping to find a text source that is tightly focused around describing events within some fairly limited topic (more 开发者_JAVA技巧along the lines of chess commentary than a chess forum, for example).

While I've had some luck with a corpus of air-traffic controller dialogs, most of sentences aren't typical English (they involve a lot of Charlie, Tango, etc.). However, if the format is as I've described then the actual topic of focus is irrelevant, so long as it has one.

Since I plan on building my own corpus out of this text, no tagging is necessary.


The Reuters corpus has a fairly monotonous content (commercial news; CEO appointments, mergers and acquisitions, major deals, etc); I am more familiar with the multilingual v2 but IIRC the v1 corpus was monolingual English. These will be multiple-sentence news stories, but in keeping with journalistic conventions, you can expect the first sentence to form a reasonable gist of the full story. http://about.reuters.com/researchandstandards/corpus/

You might also look at other TREC and especially MUC competition materials; http://en.wikipedia.org/wiki/Message_Understanding_Conference


Have you considered Usenet? It has a bunch of idiosyncratic conventions of its own but something like rec.food.cooking would seem to broadly fit your description. http://groups.google.com/group/rec.food.cooking/ Have a look at e.g. rec.sports.hockey or rec.games.video.arcade as well. There is also the 20 Newsgroups corpus if you are looking for a canonical, well-known corpus, and it contains at least some sports-related newsgroup material. http://people.csail.mit.edu/jrennie/20Newsgroups/

(Maybe in your country the "general public" is comfortable with baseball. Over here it would be football, you know, the kind where you can't use your hands.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜