Read .txt file and analyze

2023-01-28 18:31 问答作者：

I'm working Huffman coding of any .txt file, so first I need to analyse this text file. I need to read it, then analyse. I need "exit" like table:

letter | frequency(how many times same latter repeated) | Huffman code(this will come later)

I started with:

 f = open('test.txt', 'r')    #open test.tx
 for lines in f:
     print lines          #to ensure if all work...

How can I order reading characters from file in alphabetic order:

with open("test.txt") as f_in:
    for line in f_in:
        for char in line:
            frequencies[char] += 1

???Many thanks

Well I tried like this:
frequencies = collections.defaultdict(int)
with open("test.txt") as f_in:
    for line in f_in:
        for char in line:
            frequencies[char] += 1


 frequencies = [(count, char) for char, count in frequencies.iteritems()]
 frequenci开发者_如何学Pythones.sort(key=operator.itemgetter(1))

But compiler return me an "error" enter code here

I need this alphabetic order in for loop, not at end at frequencies...

To get your table of frequencies, I would use a defaultdict. This will only iterate over the data once.

import collections
import operator

frequencies = collections.defaultdict(int)
with open(filename) as f_in:
    for line in f_in:
        for char in line:
            frequencies[char] += 1


frequencies = [(count, char) for char, count in frequencies.iteritems()]
frequencies.sort(key=operator.itemgetter(1))

with open('test.txt') as f: data = f.read()
table = dict((c, data.count(c)) for c in set(data))

I made this solution using a collections.Counter():

import re
import collections


if __name__ == '__main__':
    is_letter = re.compile('[A-Za-z]')

    frequencies = collections.Counter()
    with open(r'text.txt') as f_in:
        for line in f_in:
            for char in line:
                if is_letter.match(char):
                    frequencies[char.lower()] += 1

    # Sort characters 
    characters = [x[0] for x in frequencies.most_common()]
    characters.sort()
    for c in characters:
        print c, '|', str(frequencies[c])

The regular expression is_letter is used to filter for only the characters we are interested in. It gives output that looks like this.

a | 177
b | 29
c | 7
d | 167
e | 374
f | 58
g | 100
h | 44
i | 135
j | 21
k | 64
l | 125
m | 85
n | 191
o | 105
p | 34
r | 185
s | 130
t | 146
u | 34
v | 68
x | 1
y | 14

继续阅读：huffman-code python

Read .txt file and analyze

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？