开发者

word count problem

I wanna count words from text files which contain data as follows:

ROK :
    ROK/(NN)
New :
    New/(SV)
releases, :
    releases/(NN) + ,/(SY)
week :
    week/(EP)
last :
    last/(JO)
compared :
    compare/(VV) + -ed/(EM)
year :
    year/(DT)
releases :
    releases/(NN)

The expressions like /(NN), /(SV), and /(EP) are considered category. I wanna extract the words just before each of category and count how many words are in the whole text.

I wanna write a result in a new text file like this:

(NN)
releases 2
ROK 1

(SY)
New 1
, 1

(EP)
week 1

(JO)
last 1

......

Please help me out!

here is my garage code ;_; it doesn't work.

import os, sys
import re

wordset = {}
for line in open('E:\\mach.txt', 'r'):
    if '/(' in line:
        word = re.findall(r'(\w)/\(', line)
        print word
        if word not in开发者_StackOverflow中文版 wordset: wordset[word]=1
        else: wordset[word]+=1

f = open('result.txt', 'w')
for word in wordset:
    print>> f, word, wordset[word]
f.close()


from __future__ import print_function                                                                                                                                                                                                                                  
import re                                                                                                                                                                                                                                                              


REGEXP = re.compile(r'(\w+)/(\(.*?\))')                                                                                                                                                                                                                                


def main():                                                                                                                                                                                                                                                            
    words = {}                                                                                                                                                                                                                                                         

    with open('E:\\mach.txt', 'r') as fp:
        for line in fp:                                                                                                                                                                                                                                                    
            for item, category in REGEXP.findall(line):                                                                                                                                                                                                                    
                words.setdefault(category, {}).setdefault(item, 0)                                                                                                                                                                                                         
                words[category][item] += 1                                                                                                                                                                                                                                 

    with open('result.txt', 'w') as fp:                                                                                                                                                                                                                                       
        for category, words in sorted(words.items()):                                                                                                                                                                                                                      
            print(category, file=fp)                                                                                                                                                                                                                                       
            for word, count in words.items():                                                                                                                                                                                                                              
                print(word, count, sep=' ', file=fp)                                                                                                                                                                                                                       
            print(file=fp)                                                                                                                                                                                                                                                 
    return 0                                                                                                                                                                                                                                                           

if __name__ == '__main__':                                                                                                                                                                                                                                             
    raise SystemExit(main())

You're welcome (= If you will want also count that weird "-ed" or ",", tune regexp to match any character except whitespace:

REGEXP = re.compile(r'([^\s]+)/(\(.*?\))')


You're trying to use a list (yes word is a list) as an index. Here is what you should do:

import re

wordset = {}
for line in open('testdata.txt', 'r'):
    if '/(' in line:
        words = re.findall(r'(\w)/\(', line)
        print words
        for word in words:
          if word not in wordset: 
            wordset[word]=1
          else: 
            wordset[word]+=1

f = open('result.txt', 'w')
for word in wordset:
    print>> f, word, wordset[word]
f.close()

You're lucky I want to learn python, otherwise I wouldn't have tried your code. Next time post the error you're getting! I bet it was

TypeError: unhashable type: 'list'

It's important to help us help you if you want good answers!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜