开发者

Shuffle the records of a list of text files in one single file

I have a list of text files file1.txt, file2.txt, file3.txt .. filen.txt that I need to shuffle creating one single big file as result *.

Requirements:

1. The records of a given file need to be reversed before being shuffled

2. The records of a given file should keep the reversed order in the destination file

3. I don't know how many files I need to shuffle so the code should be generic as possible (allowing to declare the file names in a list for example)

4. Files could have different sizes

Example:

File1.txt
---------
File1Record1
File1Record2
File1Record3
File1Record4

File2.txt
---------
File2Record1
File2Record2


File3.txt
---------
File3Record1
File3Record2
File3Record3
File3Record4
File3Record5

the output should be something like this:

ResultFile.txt
--------------
File3Record5   -|
File2Record2    |
File1Record4    |
File3Record4   -|
File2Record1    |
File1Record3    |-->File3 records are shuffled with the other records and 
File3Record3   -|   are correctly "reversed" and they kept the correct 
File1Record2    |   ordering
File3Record2   -|
File1Record1    |
File3Record1   -|

* I'm not crazy; I have to import these files (blog posts) using the resultfile.txt as input

EDIT:

the result could have any sort you want, completely or partially shuffled, uniformly interleaved, it does not matter. What it does matter is that points 1. and 2. are both开发者_高级运维 honoured.


What about this:

>>> l = [["1a","1b","1c","1d"], ["2a","2b"], ["3a","3b","3c","3d","3e"]]
>>> while l:
...     x = random.choice(l)
...     print x.pop(-1) 
...     if not x:
...         l.remove(x)

1d
1c
2b
3e
2a
3d
1b
3c
3b
3a
1a

You could optimize it in various ways, but that's the general idea. This also works if you cannot read the files at once but need to iterate them because of memory restrictions. In that case

  • read a line from the file instead of popping from a list
  • check for EOF instead of empty lists


you could try the following: in a first step you zip() the reversed() items of the list:

zipped = zip(reversed(lines1), reversed(lines2), reversed(lines3))

then you can concatenate the items in zipped again:

lst = []
for triple in zipped:
    lst.append(triple)

finally you have to remove all Nones added by zip()

lst.remove(None)


filenames = [ 'filename0', ... , 'filenameN' ]
files = [ open(fn, 'r') for fn in filenames ]
lines = [ f.readlines() for f in files ]

output = open('output', 'w')

while len(lines) > 0:
    l = random.choice( lines )
    if len(l)==0: 
        lines.remove(l)
    else:
        output.write( l.pop() )

output.close()

One bite may seem magical here: the lines read from files don't need reversing, because when we write them to output file we use list.pop() which takes items from the end of the list (here the contents of the file).


A simple solution might be to create a list of lists, and then pop a line off a random list until they're all exhausted:

>>> import random
>>> filerecords = [['File{0}Record{1}'.format(i, j) for j in range(5)] for i in range(5)]
>>> concatenation = []
>>> while any(filerecords):
...     selection = random.choice(filerecords)
...     if selection:
...         concatenation.append(selection.pop())
...     else:
...         filerecords.remove(selection)
... 
>>> concatenation
['File1Record4', 'File3Record4', 'File0Record4', 'File0Record3', 'File0Record2',
 'File4Record4', 'File0Record1', 'File3Record3', 'File4Record3', 'File0Record0',
 'File4Record2', 'File2Record4', 'File4Record1', 'File3Record2', 'File4Record0',
 'File2Record3', 'File1Record3', 'File2Record2', 'File2Record1', 'File3Record1',
 'File3Record0', 'File1Record2', 'File2Record0', 'File1Record1', 'File1Record0']


I strongly recommend investing some time to read Generator Tricks for Systems Programmers(PDF). It's from a presentation at PyCon 08 and it deals specifically with processing arbitrarily large log files. The reversal aspect is an interesting wrinkle, but the rest of the presentation should speak directly to your problem.


filelist = (
    'file1.txt',
    'file2.txt',
    'file3.txt',
)

all_records = []

max_records = 0
for f in filelist:
    fp = open(f, 'r')
    records = fp.readlines()
    if len(records) > max_records:
        max_records = len(records)
    records.reverse()
    all_records.append(records)
    fp.close()

all_records.reverse()

res_fp = open('result.txt', 'w')
for i in range(max_records):
    for records in all_records:
        try:
            res_fp.write(records[i])
        except IndexError:
            pass
    i += 1
res_fp.close()


I'm not a python zen master, but here's my take.

import random

#You have you read everything into a list from at least one of the files.
fin = open("filename1","r").readlines()
# tuple of all of the files.
fls = ( open("filename2","r"), 
       open("filename3","r"), )

for fl in fls: #iterate through tuple
   curr = 0
   clen = len(fin)
   for line in fl: #iterate through a file.
      # If we're at the end or 1 is randomly chosen, insert at current position.
      if curr > clen or round(random.random()):
         fin.insert(curr,line)
         clen = len(fin)
      curr +=1 #increment current index.

# when you're *done* reverse. It's easier.
fin.reverse()

Unfortunately with this it becomes obvious that this is a weighted distrobution. This can be fixed by calculating the length of each of the files and multiplying the call to random by certain probability based on that. I'll see if I can't provide that at some later point.


A possible merging function is available in the standard library. It's intended to merge sorted lists to make sorted combined lists; garbage in, garbage out, but it does have the desired property of maintaining sublist order no matter what.

def merge_files(output, *inputs):
    # all parameters are opened files with appropriate modes.
    from heapq import merge
    for line in heapq.merge(*(reversed(tuple(input)) for input in inputs)):
        output.write(line)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜