开发者

Issue in string matching in Python

I am trying to read from a file and match for a certain combination of strings. PFB my program:

def negative_verbs_features(filename):

    # Open and read the file content
    file = open (filename, "r")
    text = file.read()
    file.close()

    # Create a list of negative verbs from the MPQA lexicon
    file_negative_mpqa = open("../data/PolarLexicons/negative_mpqa.txt", "r")
    negative_verbs = []
    for line in file_negative_mpqa:
        #print line,
        pos, word = line.split(",")
        #print line.split(",")      
        if pos == "verb":
            negative_verbs.append(word)
    return negative_verbs

if __name__ == "__main__":
    print negative_verbs_features("../data/test.txt")

The file negative_mpqa.txt consists of word, part-of-speech tag pairs separated by a comma(,). Here's a snippet of the file:

abandoned,adj
abandonment,noun
abandon,verb
abasement,anypos
abase,verb
abash,verb
abate,verb
abdicate,verb
aberration,adj
aberration,noun

I would like create a list of all words in the file which has verb as it's part-of-speech. However, when I run my program and the list returned (negative_verbs) is always empty. The if loop wasn't executing. I tried printing word,pos pair by uncommenting the line print line.split(",") PFB a snippet of the ouput.

['wrongful', 'adj\r\n']
开发者_JAVA技巧['wrongly', 'anypos\r\n']
['wrought', 'adj\r\n']
['wrought', 'noun\r\n']
['yawn', 'noun\r\n']
['yawn', 'verb\r\n']
['yelp', 'verb\r\n']
['zealot', 'noun\r\n']
['zealous', 'adj\r\n']
['zealously', 'anypos\r\n']

I understand my file may have some special characters like newline and return feed at the end of every line. I just want to ignore them and build my list. Kindly let me know how to proceed.

PS: I am newbie in Python.


You said the file has lines like this: abandoned,adj so those are word, pos pairs. But you wrote pos, word = line.split(",") which means that pos == 'abandoned' and word == 'adj' ... I think it's clear why the list will be empty now :-)


Replace the line pos, word = line.split(",") by

word, pos = line.rstrip().split(",")

rstrip() removes the white characters (spaces, new lines, carriage return...) at the right of your string. Note that lstrip() and even strip() also exist. You also switched word and pos!

You could also use rstrip() on your word variable instead, when you append it to your list.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜