Google Summer of Code: web classification dataset
I heard that Google hosted (or will host) a web classification competition and they provided a large (170k+ d开发者_如何学运维ocuments) dataset of web sites that were classified into multiple categories (sports, computers, science, etc.) I tried looking around in their Summer of Code web site for 2009 through 2011, but didn't find anything. Does anybody know where I can get that dataset?
I think I found it (although I'm not sure if the data was provided by google): the ECML/PKDD 2010 Discovery Challenge Data Set contains 22 training labels (i.e. labels about the content), URLs and hyperlinks, content-based and link-based web spam features, term frequencies and Natural Language Processing features.
精彩评论