开发者

I am trying to determine if a string is a Question. How can I analyze the "?" symbol (python)

This is a question:

"Where is the car?" 

This is NOT a question:

"Check this out: http://domain.com/?q=test"

How do I write a function to analyze a string so that we know for sure it is a 开发者_JAVA技巧question and not part of a URL?


This regex finds question marks following a word character, and followed by either whitespace or the end of the string/line. Not perfect, but should catch most cases...

\w\?[$\s]

Edit (lack of caffeine strikes...):

That should have been:

\w\?(\s|$)

In the original, $ is interpreted as a literal character. (Thanks Gumbo)


If question mark is always there, you could check like

if question.strip().endswith("?") and "://" not in question:
    # do something ?

If you really want to parse real sentence, you may need nltk, I am not sure for that case.

p.s this is just an sample if the text is fixed, nobody can parse real English grammar with regex.


Essentially what others say is correct. There should be no whitespace before the ?. If the question is entered by a user, things get more ambiguous however.

In that case a proper parser using a context free grammar may yield better results. Even with questions not having a question mark at the end. But it may not recognize all questions. Covering all possible structure variations, inflections and whatnot is not straight-forward.

But, if you are certain that the questions always end with a question mark, you could do something as simple as

if question_text.strip().endswith("?"):
    print `question_text`, "is a question"

Or:

import re
p = re.compile( r"\w+\?\s*" )
if p.search( question_text ):
    print `question_text`, "contains a question"

Not tested, but should work for most cases.


You can for example check if the question mark is immediately followed by a non-space, non-line break character. But I guess that a more safe way would be to strip any possible URL from the string before searching the question mark on it.


The question mark will not have white space either side or a line break/end-of-string after it, if it is in a url


A probably not very robust approach that you might be able to get some traction with would be to look for "question words" in strings that end with question marks. In English, most question sentences or clauses (i.e. following a comma) start with "who", "what", "where", "when", "how", "why", "can", "may", "will", "won't, "does", "doesn't", etc. You could probably build up a pretty good heuristic this way that might work better than a regex (or could be incorporated into one or more regexes).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜