How to sum up the word count for each person in a dialogue?

2023-04-04 21:42 问答作者：

I'm starting to learn Python and I'm trying to write a program that would import a text file, count the total number of words, count the number of words in a specific paragraph (said by each participant, described by 'P1', 'P2' etc.), exclude these words (i.e. 'P1' etc.) from my word count, and print paragraphs separately.

Thanks to @James Hurford I got this code:

words = None
with open('data.txt') as f:
   words = f.read().split()
total_words = len(words)
print 'Total words:', total_words

in_para = False
para_type = None
paragraph = list()
for word in words:
  if ('P1' in word or
      'P2' in word or
      'P3' in word ):开发者_运维问答
      if in_para == False:
         in_para = True
         para_type = word
      else:
         print 'Words in paragraph', para_type, ':', len(paragraph)
         print ' '.join(paragraph)
         del paragraph[:]
         para_type = word
  else:
    paragraph.append(word)
else:
  if in_para == True:
    print 'Words in last paragraph', para_type, ':', len(paragraph)
    print ' '.join(paragraph)
  else:
    print 'No words'

My text file looks like this:

P1: Bla bla bla.

P2: Bla bla bla bla.

P1: Bla bla.

P3: Bla.

The next part I need to do is summing up the words for each participant. I can only print them, but I don't know how to return/reuse them.

I would need a new variable with word count for each participant that I could manipulate later on, in addition to summing up all the words said by each participant, e.g.

P1all = sum of words in paragraph

Is there a way to count "you're" or "it's" etc. as two words?

Any ideas how to solve it?

I would need a new variable with word count for each participant that I could manipulate later on

No, you would need a Counter (Python 2.7+, else use a defaultdict(int)) mapping persons to word counts.

from collections import Counter
#from collections import defaultdict

words_per_person = Counter()
#words_per_person = defaultdict(int)

for ln in inputfile:
    person, text = ln.split(':', 1)
    words_per_person[person] += len(text.split())

Now words_per_person['P1'] contains the number of words of P1, assuming text.split() is a good enough tokenizer for your purposes. (Linguists disagree about the definition of word, so you're always going to get an approximation.)

Congrats on beginning your adventure with Python! Not everything in this post might make sense right now but bookmark it and comeback to it if it seems helpful later. Eventually you should try to move from scripting to software engineering, and here are a few ideas for you!

With great power comes great responsibility, and as a Python developer you need to be more disciplined than other languages which don't hold your hand and enforce "good" design.

I find it helps to start with a top-down design.

def main():
    text = get_text()
    p_text = process_text(text)
    catalogue = process_catalogue(p_text)

BOOM! You just wrote the whole program -- now you just need to back and fill in the blanks! When you do it like this, it seems less intimidating. Personally, I don't consider myself smart enough to solve very big problems, but I'm a pro at solving small problems. So lets tackle one thing at a time. I'm going to start with 'process_text'.

def process_text(text):
    b_text = bundle_dialogue_items(text)   
    f_text = filter_dialogue_items(b_text)
    c_text = clean_dialogue_items(f_text)

I'm not really sure what those things mean yet, but I know that text problems tend to follow a pattern called "map/reduce" which means you perform and operation on something and then you clean it up and combine, so I put in some placeholder functions. I might go back and add more if necessary.

Now let's write 'process_catalogue'. I could've written "process_dict" but that sounded lame to me.

def process_catalogue(p_text): 
    speakers = make_catalogue(c_text)
    s_speakers = sum_words_per_paragraph_items(speakers)
    t_speakers = total_word_count(s_speakers)

Cool. Not too bad. You might approach this different than me, but I thought it would make sense to aggregate the items, the count the words per paragraph, and then count all the words.

So, at this point I'd probably make one or two little 'lib' (library) modules to back-fill the remaining functions. For the sake you being able to run this without worrying about imports, I'm going to stick it all in one .py file, but eventually you'll learn how to break these up so it looks nicer. So let's do this.

# ------------------ #
# == process_text == #
# ------------------ #

def bundle_dialogue_items(lines):
    cur_speaker = None
    paragraphs = Counter()
    for line in lines:
        if re.match(p, line):
            cur_speaker, dialogue = line.split(':')
            paragraphs[cur_speaker] += 1
        else:
            dialogue = line

        res = cur_speaker, dialogue, paragraphs[cur_speaker]
        yield res


def filter_dialogue_items(lines):
    for name, dialogue, paragraph in lines:
        if dialogue:
            res = name, dialogue, paragraph
            yield res

def clean_dialogue_items(flines):
    for name, dialogue, paragraph in flines:
        s_dialogue = dialogue.strip().split()
        c_dialouge = [clean_word(w) for w in s_dialogue]
        res = name, c_dialouge, paragraph
        yield res

aaaand a little helper function

# ------------------- #
# == aux functions == #
# ------------------- #

to_clean = string.whitespace + string.punctuation
def clean_word(word):
    res = ''.join(c for c in word if c not in to_clean)
    return res

So it may not be obvious but this library is designed as a data processing pipeline. There several ways to process data, one is pipeline processing and another is batch processing. Let's take a look at batch processing.

# ----------------------- #
# == process_catalogue == #
# ----------------------- #

speaker_stats = 'stats'
def make_catalogue(names_with_dialogue):
    speakers = {}
    for name, dialogue, paragraph in names_with_dialogue:
        speaker = speakers.setdefault(name, {})
        stats = speaker.setdefault(speaker_stats, {})
        stats.setdefault(paragraph, []).extend(dialogue)
    return speakers



word_count = 'word_count'
def sum_words_per_paragraph_items(speakers):
    for speaker in speakers:
        word_stats = speakers[speaker][speaker_stats]
        speakers[speaker][word_count] = Counter()
        for paragraph in word_stats:
            speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
    return speakers


total = 'total'
def total_word_count(speakers):
    for speaker in speakers:
        wc = speakers[speaker][word_count]
        speakers[speaker][total] = 0
        for c in wc:
            speakers[speaker][total] += wc[c]
    return speakers

All these nested dictionaries are getting a little complicated. In actual production code I would replace these with some more readable classes (along with adding tests and docstrings!!), but I don't want to make this more confusing than it already is! Alright, for your convenience below is the whole thing put together.

import pprint
import re
import string
from collections import Counter

p = re.compile(r'(\w+?):')


def get_text_line_items(text):
    for line in text.split('\n'):
        yield line


def bundle_dialogue_items(lines):
    cur_speaker = None
    paragraphs = Counter()
    for line in lines:
        if re.match(p, line):
            cur_speaker, dialogue = line.split(':')
            paragraphs[cur_speaker] += 1
        else:
            dialogue = line

        res = cur_speaker, dialogue, paragraphs[cur_speaker]
        yield res


def filter_dialogue_items(lines):
    for name, dialogue, paragraph in lines:
        if dialogue:
            res = name, dialogue, paragraph
            yield res


to_clean = string.whitespace + string.punctuation


def clean_word(word):
    res = ''.join(c for c in word if c not in to_clean)
    return res


def clean_dialogue_items(flines):
    for name, dialogue, paragraph in flines:
        s_dialogue = dialogue.strip().split()
        c_dialouge = [clean_word(w) for w in s_dialogue]
        res = name, c_dialouge, paragraph
        yield res


speaker_stats = 'stats'


def make_catalogue(names_with_dialogue):
    speakers = {}
    for name, dialogue, paragraph in names_with_dialogue:
        speaker = speakers.setdefault(name, {})
        stats = speaker.setdefault(speaker_stats, {})
        stats.setdefault(paragraph, []).extend(dialogue)
    return speakers


def clean_dict(speakers):
    for speaker in speakers:
        stats = speakers[speaker][speaker_stats]
        for paragraph in stats:
            stats[paragraph] = [''.join(c for c in word if c not in to_clean)
                                for word in stats[paragraph]]
    return speakers


word_count = 'word_count'


def sum_words_per_paragraph_items(speakers):
    for speaker in speakers:
        word_stats = speakers[speaker][speaker_stats]
        speakers[speaker][word_count] = Counter()
        for paragraph in word_stats:
            speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
    return speakers


total = 'total'


def total_word_count(speakers):
    for speaker in speakers:
        wc = speakers[speaker][word_count]
        speakers[speaker][total] = 0
        for c in wc:
            speakers[speaker][total] += wc[c]
    return speakers


def get_text():
    text = '''BOB: blah blah blah blah
blah hello goodbye etc.

JERRY:.............................................
...............

BOB:blah blah blah
blah blah blah
blah.
BOB: boopy doopy doop
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.'''
    text = get_text_line_items(text)
    return text


def process_catalogue(c_text):
    speakers = make_catalogue(c_text)
    s_speakers = sum_words_per_paragraph_items(speakers)
    t_speakers = total_word_count(s_speakers)
    return t_speakers


def process_text(text):
    b_text = bundle_dialogue_items(text)
    f_text = filter_dialogue_items(b_text)
    c_text = clean_dialogue_items(f_text)
    return c_text


def main():

    text = get_text()
    c_text = process_text(text)
    t_speakers = process_catalogue(c_text)

    # take a look at your hard work!
    pprint.pprint(t_speakers)


if __name__ == '__main__':
    main()

So this script is almost certainly overkill for this application, but the point is to see what (questionably) readable, maintainable, modular Python code might look like.

Pretty sure output looks something like:

{'BOB': {'stats': {1: ['blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'hello',
                       'goodbye',
                       'etc'],
                   2: ['blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah'],
                   3: ['boopy', 'doopy', 'doop']},
         'total': 18,
         'word_count': Counter({1: 8, 2: 7, 3: 3})},
 'JERRY': {'stats': {1: ['', '']}, 'total': 2, 'word_count': Counter({1: 2})},
 'P1': {'stats': {1: ['Bla', 'bla', 'bla'], 2: ['Bla', 'bla']},
        'total': 5,
        'word_count': Counter({1: 3, 2: 2})},
 'P2': {'stats': {1: ['Bla', 'bla', 'bla', 'bla']},
        'total': 4,
        'word_count': Counter({1: 4})},
 'P3': {'stats': {1: ['Bla']}, 'total': 1, 'word_count': Counter({1: 1})}}

You can do this with two variables. One to keep track of what person is speaking, the other to keep the paragraphs for the persons speaking. For storing the paragraphs and associating who it is that the paragraph belongs to use a dict with the person as the key and a list of paragraphs that person said associated with this key.

para_dict = dict()
para_type = None

for word in words:
    if ('P1' in word or
        'P2' in word or
        'P3' in word ):
        #extract the part we want leaving off the ':'
        para_type = word[:2]
        #create a dict with a list of lists 
        #to contain each paragraph the person uses
        if para_type not in para_dict:
            para_dict[para_type] = list()
        para_dict[para_type].append(list())
    else:
        #Append the word to the last list in the list of lists
        para_dict[para_type][-1].append(word)

From here you can sum up the number of words spoken thus

for person, para_list in para_dict.items():
    counts_list = list()
    for para in para_list:
        counts_list.append(len(para))
    print person, 'spoke', sum(counts_list), 'words'

继续阅读：count python text-mining

How to sum up the word count for each person in a dialogue?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？