how to get category among words in wikipedia?
i've problem about extracting category among words. i have several words in a cluster ("apple","iMac","snowleopard") and i would like to retrieve catego开发者_如何转开发ry among that words.
("apple","iMac","snowleopard") --> "Mac OS X"
i've tried using lexical database such as WordNet, but it won't work. i've been looking for other methods and found that wikipedia may help. any java library for wikipedia? and how to do such task i've mentioned above? Thanks
You can try using Wikipedia to extract some meaning from these terms. For example, the following query against the Wikipedia API:
http://en.wikipedia.org/w/api.php?action=query&prop=categories&format=json&clshow=!hidden&cllimit=10&generator=search&gsrsearch=apple%20iMac%20snowleopard%22&gsrnamespace=0&gsrprop=titlesnippet&gsrredirects=&gsrlimit=10
Yields the following result:
{
"query": {
"searchinfo": {
"totalhits": 3,
"suggestion": "apple iMac snow leopard\"\""
},
"pages": {
"2020710": {
"pageid": 2020710,
"ns": 0,
"title": "Apple's transition to Intel processors",
"categories": [
{
"ns": 14,
"title": "Category:Apple Inc."
},
{
"ns": 14,
"title": "Category:Intel Corporation"
},
{
"ns": 14,
"title": "Category:Mac OS X"
}
]
},
"14059031": {
"pageid": 14059031,
"ns": 0,
"title": "Mac OS X Snow Leopard",
"categories": [
{
"ns": 14,
"title": "Category:2009 software"
},
{
"ns": 14,
"title": "Category:Mac OS X"
}
]
},
"20640": {
"pageid": 20640,
"ns": 0,
"title": "OS X",
"categories": [
{
"ns": 14,
"title": "Category:1999 software"
},
{
"ns": 14,
"title": "Category:Apple Inc. operating systems"
},
{
"ns": 14,
"title": "Category:Apple Inc. software"
},
{
"ns": 14,
"title": "Category:Mac OS X"
},
{
"ns": 14,
"title": "Category:Mach"
}
]
}
}
},
"query-continue": {
"categories": {
"clcontinue": "14059031|X86-64 operating systems"
}
}
}
May not be easy to determine from this data what is the "correct" category, but it's a start.
精彩评论