Extract items from n-line chunks in a file, count frequency of items for each chunk, Python
I have a text file containing 5-line chunks of tab-delimited lines:
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
etc.
In each chunk, the DESCRIPTION and SENTENCE columns are the same. The data of interest is in the ITEMS column which is different for each line in the chunk and is in the following format:
word1, word2, word3
...and so on
For each 5-line chunk, I need to count the frequency of word1, word2, etc. in ITEMS. For example, if the first 5-line chunk was as follows
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
1 \t DESCRIPTION \t SENTENCE \t word4
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
then the correct output fo开发者_StackOverflow社区r this 5-line chunk would be
1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)
I.e, the chunk number followed by the sentence followed by the frequency counts for the words.
I have some code to extract the five-line chunks and to count the frequency of words in a chunk once it's extracted, but am stuck on the task of isolating each chunk, getting the word frequencies, moving on to the next, etc.
from itertools import groupby
def GetFrequencies(file):
file_contents = open(file).readlines() #file as list
"""use zip to get the entire file as list of 5-line chunk tuples"""
five_line_increments = zip(*[iter(file_contents)]*5)
for chunk in five_line_increments: #for each 5-line chunk...
for sentence in chunk: #...and for each sentence in that chunk
words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
words_no_comma = [x.strip(',') for x in words] #get rid of the commas
words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas
"""STUCK HERE The idea originally was to take the words lists for
each chunk and combine them to create a big list, 'collection,' and
feed this into the for-loop below."""
for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
print key,len(list(group)),
Using python 2.7
#!/usr/bin/env python
import collections
chunks={}
with open('input') as fd:
for line in fd:
line=line.split()
if not line:
continue
if chunks.has_key(line[0]):
for i in line[3:]:
chunks[line[0]].append(i.replace(',',''))
else:
chunks[line[0]]=[line[2]]
for k,v in chunks.iteritems():
counter=collections.Counter(v[1:])
print k, v[0], counter
Outputs:
1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})
There's a csv parser in the standard library that can handle the input splitting for you
import csv
import collections
def GetFrequencies(file_in):
sentences = dict()
with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file:
for line in csv_file:
sentence = line[0]
if sentence not in sentences:
sentences[sentence] = collections.Counter()
sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])
Edited your code a little bit, I think it does what you want it to do:
file_contents = open(file).readlines() #file as list
"""use zip to get the entire file as list of 5-line chunk tuples"""
five_line_increments = zip(*[iter(file_contents)]*5)
for chunk in five_line_increments: #for each 5-line chunk...
word_freq = {} #word frequencies for each chunk
for sentence in chunk: #...and for each sentence in that chunk
words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list
for word in words:
if word not in word_freq:
word_freq[word] = 1
else:
word_freq[word] += 1
print word_freq
Output:
{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}
To summarize: You want to append all "words" to a collection if they are not "DESCRIPTION" or "SENTENCE"? Try this:
for word in words_no_ws:
if word not in ("DESCRIPTION", "SENTENCE"):
collection.append(word)
精彩评论