Extract items from n-line chunks in a file, count frequency of items for each chunk, Python

2023-03-29 21:06 问答作者：

I have a text file containing 5-line chunks of tab-delimited lines:

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

etc.

In each chunk, the DESCRIPTION and SENTENCE columns are the same. The data of interest is in the ITEMS column which is different for each line in the chunk and is in the following format:

word1, word2, word3

...and so on

For each 5-line chunk, I need to count the frequency of word1, word2, etc. in ITEMS. For example, if the first 5-line chunk was as follows

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

 1 \t DESCRIPTION \t SENTENCE \t word4

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

then the correct output fo开发者_StackOverflow社区r this 5-line chunk would be

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

I.e, the chunk number followed by the sentence followed by the frequency counts for the words.

I have some code to extract the five-line chunks and to count the frequency of words in a chunk once it's extracted, but am stuck on the task of isolating each chunk, getting the word frequencies, moving on to the next, etc.

from itertools import groupby 

def GetFrequencies(file):
    file_contents = open(file).readlines()  #file as list
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments:  #for each 5-line chunk... 
        for sentence in chunk:          #...and for each sentence in that chunk
            words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
            words_no_comma = [x.strip(',') for x in words]  #get rid of the commas
            words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas


       """STUCK HERE   The idea originally was to take the words lists for 
       each chunk and combine them to create a big list, 'collection,' and
       feed this into the for-loop below."""





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
        print key,len(list(group)),

Using python 2.7

#!/usr/bin/env python

import collections

chunks={}

with open('input') as fd:
    for line in fd:
        line=line.split()
        if not line:
            continue
        if chunks.has_key(line[0]):
            for i in line[3:]:
                chunks[line[0]].append(i.replace(',',''))
        else:
            chunks[line[0]]=[line[2]]

for k,v in chunks.iteritems():
    counter=collections.Counter(v[1:])
    print k, v[0], counter

Outputs:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})

There's a csv parser in the standard library that can handle the input splitting for you

import csv
import collections

def GetFrequencies(file_in):
    sentences = dict()
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file:
        for line in csv_file:
            sentence = line[0]
            if sentence not in sentences:
                sentences[sentence] = collections.Counter()
            sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])

Edited your code a little bit, I think it does what you want it to do:

file_contents = open(file).readlines()  #file as list
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments:  #for each 5-line chunk...
    word_freq = {} #word frequencies for each chunk
    for sentence in chunk:          #...and for each sentence in that chunk
        words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list
        for word in words:
            if word not in word_freq:
                word_freq[word] = 1
            else:
                word_freq[word] += 1


    print word_freq

Output:

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}

To summarize: You want to append all "words" to a collection if they are not "DESCRIPTION" or "SENTENCE"? Try this:

for word in words_no_ws:
    if word not in ("DESCRIPTION", "SENTENCE"):
        collection.append(word)

继续阅读：python text-processing

Extract items from n-line chunks in a file, count frequency of items for each chunk, Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？