开发者

Extract items from n-line chunks in a file, count frequency of items for each chunk, Python

I have a text file containing 5-line chunks of tab-delimited lines:

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

etc.

In each chunk, the DESCRIPTION and SENTENCE columns are the same. The data of interest is in the ITEMS column which is different for each line in the chunk and is in the following format:

word1, word2, word3

...and so on

For each 5-line chunk, I need to count the frequency of word1, word2, etc. in ITEMS. For example, if the first 5-line chunk was as follows

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

 1 \t DESCRIPTION \t SENTENCE \t word4

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

then the correct output fo开发者_StackOverflow社区r this 5-line chunk would be

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

I.e, the chunk number followed by the sentence followed by the frequency counts for the words.

I have some code to extract the five-line chunks and to count the frequency of words in a chunk once it's extracted, but am stuck on the task of isolating each chunk, getting the word frequencies, moving on to the next, etc.

from itertools import groupby 

def GetFrequencies(file):
    file_contents = open(file).readlines()  #file as list
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments:  #for each 5-line chunk... 
        for sentence in chunk:          #...and for each sentence in that chunk
            words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
            words_no_comma = [x.strip(',') for x in words]  #get rid of the commas
            words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas


       """STUCK HERE   The idea originally was to take the words lists for 
       each chunk and combine them to create a big list, 'collection,' and
       feed this into the for-loop below."""





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
        print key,len(list(group)),    


Using python 2.7

#!/usr/bin/env python

import collections

chunks={}

with open('input') as fd:
    for line in fd:
        line=line.split()
        if not line:
            continue
        if chunks.has_key(line[0]):
            for i in line[3:]:
                chunks[line[0]].append(i.replace(',',''))
        else:
            chunks[line[0]]=[line[2]]

for k,v in chunks.iteritems():
    counter=collections.Counter(v[1:])
    print k, v[0], counter

Outputs:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})


There's a csv parser in the standard library that can handle the input splitting for you

import csv
import collections

def GetFrequencies(file_in):
    sentences = dict()
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file:
        for line in csv_file:
            sentence = line[0]
            if sentence not in sentences:
                sentences[sentence] = collections.Counter()
            sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])


Edited your code a little bit, I think it does what you want it to do:

file_contents = open(file).readlines()  #file as list
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments:  #for each 5-line chunk...
    word_freq = {} #word frequencies for each chunk
    for sentence in chunk:          #...and for each sentence in that chunk
        words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list
        for word in words:
            if word not in word_freq:
                word_freq[word] = 1
            else:
                word_freq[word] += 1


    print word_freq

Output:

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}


To summarize: You want to append all "words" to a collection if they are not "DESCRIPTION" or "SENTENCE"? Try this:

for word in words_no_ws:
    if word not in ("DESCRIPTION", "SENTENCE"):
        collection.append(word)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜