Extracting all nouns, adjectives form and text via Stanford parser
I'm trying to extract all nouns and adjectives from a given text 开发者_StackOverflow中文版via the Stanford parser.
My current attempt is using pattern matching in the getChildrenAsList() of the Tree-Object for locating things like:
(NN paper), (NN algorithm), (NN information), ...
and saving them in an array.
Input sentence:
In this paper we present an algorithm that extracts semantic information from an arbitrary text.
Result - String:
[(S (PP (IN In) (NP (DT this) (NN paper))) (NP (PRP we)) (VP (VBP present) (NP (NP (DT an) (NN algorithm)) (SBAR (WHNP (WDT that)) (S (VP (VBD extracts) (NP (JJ semantic) (NN information)) (PP (IN from) (NP (DT an) (ADJP (JJ arbitrary)) (NN text)))))))) (. .))]
I try using pattern matching because i couldn't find a method in the Stanford parser that returns all word classes like nouns for example.
Is there a better way for extracting these words classes or does the parser provide specific methods?
public static void main(String[] args) {
String str = "In this paper we present an algorithm that extracts semantic information from an arbitrary text.";
LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
Tree parseS = (Tree) lp.apply(str);
System.out.println("tr.getChildrenAsList().toString()"+ parseS.getChildrenAsList().toString());
}
}
BTW, if all you want are parts of speech like nouns and verbs, you should just use a part of speech tagger, such as the Stanford POS tagger. It'll run a couple of orders of magnitude more quickly and be at least as accurate.
But you can do it with the parser. The method you want is taggedYield()
which returns a List<TaggedWord>
. So you have
List<TaggedWord> taggedWords = (Tree) lp.apply(str);
for (TaggedWord tw : taggedWords) {
if (tw.tag().startsWith("N") || tw.tag().startsWith("J")) {
System.out.printf("%s/%s%n", tw.word(), tw.tag());
}
}
(This method cuts a corner, knowing that all and only adjective and noun tags start with J or N in the Penn treebank tag set. You could more generally check for membership in a set of tags.)
p.s. Use of the tag stanford-nlp is best for Stanford NLP tools on stackoverflow.
I'm sure you'd be aware of the nltk ( natural language toolkit ) just install this python library and also the maxent pos tagger along with it and the following code should do the trick. The tagger has been trained on Penn so the tags are not different. The code above is not but i love nltk, hence.
import nltk
nouns=[]
adj=[]
#read the text into the variable "text"
text = nltk.word_tokenize(text)
tagged=nltk.pos_tag(text)
for i in tagged:
if i[1][0]=="N":
nouns+=[i[0]]
elif i[1][0]=="J":
adj+=[i[0]]
精彩评论