Where can I find texts that describe topic-specific events?
So, some background: I'm trying to train a ML system to answer questions about events, where both the event descriptions and questions are posed in natural language; the event descriptions are constrained to being single sentences.
So far the main problem with this has been locating a corpus that describes events with a limited enough vocabulary to pose similar questions across all of the events (e.g. if all of the events involved chess, I could reasonably ask 'what piece moved?' and an answer could be drawn from a decent percentage of the event description sentences).
With that in mind, I'm hoping to find a text source that is tightly focused around describing events within some fairly limited topic (more 开发者_JAVA技巧along the lines of chess commentary than a chess forum, for example).
While I've had some luck with a corpus of air-traffic controller dialogs, most of sentences aren't typical English (they involve a lot of Charlie, Tango, etc.). However, if the format is as I've described then the actual topic of focus is irrelevant, so long as it has one.
Since I plan on building my own corpus out of this text, no tagging is necessary.
The Reuters corpus has a fairly monotonous content (commercial news; CEO appointments, mergers and acquisitions, major deals, etc); I am more familiar with the multilingual v2 but IIRC the v1 corpus was monolingual English. These will be multiple-sentence news stories, but in keeping with journalistic conventions, you can expect the first sentence to form a reasonable gist of the full story. http://about.reuters.com/researchandstandards/corpus/
You might also look at other TREC and especially MUC competition materials; http://en.wikipedia.org/wiki/Message_Understanding_Conference
Have you considered Usenet? It has a bunch of idiosyncratic conventions of its own but something like rec.food.cooking
would seem to broadly fit your description. http://groups.google.com/group/rec.food.cooking/ Have a look at e.g. rec.sports.hockey
or rec.games.video.arcade
as well. There is also the 20 Newsgroups corpus if you are looking for a canonical, well-known corpus, and it contains at least some sports-related newsgroup material. http://people.csail.mit.edu/jrennie/20Newsgroups/
(Maybe in your country the "general public" is comfortable with baseball. Over here it would be football, you know, the kind where you can't use your hands.)
精彩评论