开发者

Is there a way to chunk 2 or more repititions of a tag in a tagged sentence using nltk?

I'm trying to use the nltk module in python to chunk together any instances where two to five nouns occur in sequence.

This is the code I am using:

parse_pattern  = "Keyword: {< N>{2,5}}"
keyword_parser = nltk.RegexpParser(parse_pattern)
result = keyword_parser.parse(sentence)

I makes sense that this bit should do the trick: Keyword: {< N>{2,5}}

I even found an example in the book Natural Language Processing with Python that uses the above bit completely analogously: NOUNS: {< N.*>{4,}} where the authors explain that that bit of code should chunk 4 or more nouns.

However, I get an error when I run the above code:

ValueError: Illegal chunk pattern: {< N>{2,5}}

Note: I also tried the above using {< N.*>{2,5}} (with the dot star solely because the author of the aforementioned book did) with no luck.

Any help in how to chunk two or more repetitions of a 开发者_运维知识库tag would be highly appreciated.


The ValueError is probably triggered by the space between the opening angle bracket and the N.

parse_pattern = "Keyword: {<N>{2,5}}" rather than
parse_pattern = "Keyword: {< N>{2,5}}"

Also, don't worry about using the syntax with the extra dot star, as this is only necessary if you are trying to match all tags that start with, here, N.

If all fails, you may try the alternative expression which doesn't require the {min, max} syntax for the occurrences range. parse_pattern = "Keyword: {<N><N><N>?<N>?<N>?}"

And if that even fails, maybe try just parse_pattern = "Keyword: {<N>}", this hopefully would get something to work or otherwise maybe help pinpoint what else may be wrong with your setup.


nltk tags nouns with the following tags:

  • <NN> for a singular noun
  • <NNP> for a singular proper noun
  • <NNS> for a plural noun
  • <NNPS> for a plural proper noun

Thus if you want to catch any of these between two and five times, you'll want the regex:

<NN.*>{2,5}

With your example, that would be:

parse_pattern  = "Keyword: {<NN.*>{2,5}}"
keyword_parser = nltk.RegexpParser(parse_pattern)
result         = keyword_parser.parse(sentence)

Note that sentence must be tagged, e.g.

sentence = [("dog", "NN"), ("David", "NNP"), ("cats", "NNS")]


look for the code of regex.py package, that the method of tag_pattern2re_pattern(), which functionality convert tag_pattern to correct regular expression. Whereas the constant parameter CHUNK_TAG_PATTERN that is immutable, which starts with some special character and ends with special character, such like '('、' '、'<'、')'、'>'、'>'. So the tag pattern CHUNK:{<V.*><TO><V.*>} is correct, but the tag pattern CHUNK:{<V>.*<TO><V.*>{1,}} is incorrect

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜