Python - suggestions for filtering/sorting a large file?

2023-03-26 20:04 问答作者：

I have a file containing ~20 million lines (~1.5GB). Each line is of the form:

entry_1 entry_2 entry_3 ......... entry_5

The file contains duplicates but of the form:

entry_2 entry_1 entry_3 ......... entry_5

The content开发者_JAVA百科s of some lines are identical but the first two elements are often (possibly always) switched.

Does anyone have any suggestions on how to remove duplicates of this nature from a file of this size?

Thanks.

A suitable solution would depend on what constraints you have and how often you need to run this operation.

If this is a one-time (or infrequent) operation and if memory usage is not a big issue, something like this would be sufficient:

visited = set() # use set for faster lookups
with open(out_filename, "w") as outfile:
    with open(in_filename, "r") as infile:
        for line in infile:
            x = line.split()
            k = (tuple(sorted(x[:2])), tuple(x[2:]))
            if k not in visited:
                outfile.write(line)
                visited.add(k)

The memory usage depend on the number of unique entries which we need to keep track of in visited. If there aren't many duplicates, you'll end up with almost all the data in memory.

If memory usage becomes an issue, you can do this in multiple stages:

First pre-process the file by sorting the first 2 elements in each line.
Sort the whole file by lines
Removing duplicates is now straight forward since duplicate entries will appear together.

Step 2 & 3 can be merged since you can simple discard duplicates when comparing entries while performing the sort.

If you don't mind using the shell, step 2 and 3 can be done using sort -u yourfile.

Note that this changes the order of the lines in the file (which you mentioned is not an issue).

To drastically reduce memory usage at the expense of some performance, you can use a file-based db to store and look up visited entries (in place of set()).

You can speed this up by storing a hash of the entries in memory, and only query the db when a hash matches to confirm if the entries are indeed the same. The hash can be as simple as taking the first char of each entry, or using the builtin hash() function, or choose an existing hash algorithm. Each method would be a compromise between performance, hash size, and conflict frequency. A good choice would depend on your data and your constraints.

This will take some effort to find an optimum solution. Only worth embarking on if you need to perform this operation often.

separator = " "
aSet = set()
with open('myfile', 'rU') as infile:
    for line in infile:
        tempLine = line.split(separator)
        tempLine = tempLine[1:2] + tempLine[0:1] + tempLine[2:]
        tempLine = separator.join(tempLine)
        if line not in aSet and tempLine not in aSet:
            aSet.add(line)

Now, aSet contains the list of lines that are unique regardless of whether entry_1 and entry_2 have been swapped.

EDIT: If all entries can be swapped and the line still considered unique, then:

separator = " "
aSet = set()
with open('myfile', 'rU') as infile:
    for line in infile:
        aSet.add(frozenset(line.split(separator)))

继续阅读：filter python sorting

Python - suggestions for filtering/sorting a large file?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？