开发者

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Please advise - I'm going to use this asa learning point. I'm a beginner.

I'm splitting a 25mb file into several smaller file.

A Kindly guru here gave me a Ruby sript. It works beautifully fast. So, in order to learn I mimicked it with a python script. This runs like a three-legged cat (slow). I wonder if anyone can tell me why?

My python script

    ##split a file into smaller files
###########################################
def splitlines (file) :
        fileNo=0001
        outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
        fh = open(file, "r") ## open the file for reading
        mylines = fh.readlines() ### read in lines
        for line in mylines: ## for each line
                        if re.search("Copyright ", line): # if the line is equal to the regex
                  开发者_StackOverflow社区          outFile.close()  ##  close the file
                            fileNo +=1  #and add one to the filename, starting to read lines in again
                        else: # otherwise
                            outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
                            outFile.write(line)          ## then append it to the open outFile          
        fh.close()

The guru's Ruby 1.9 script

g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
  if line[/\d+ of \d+ DOCUMENTS/]
    f.close
    f=File.open(g.to_s + ".txt","w")
    g+=1
  end
  f.print line
end


There are many reasons why your script is slow -- the main reason being that you reopen the outputfile for almost every line you write. Since the old file gets implicitly closed on opening a new one (due to Python garbage collection), the write buffer is flushed for every single line you write, which is quite expensive.

A cleaned up and corrected version of your script would be

def file_generator():
    file_no = 1
    while True:
        f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
                 r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
        yield f
        f.close()
        file_no += 1

def splitlines(filename):
    files = file_generator()
    out_file = next(files)
    with open(filename) as in_file:
        for line in in_file:
            if "Copyright " in line:
                out_file = next(files)
            out_file.write(line)
        out_file.close()


I guess the reason your script is so slow is that you open a new file descriptor for each line. If you look at your guru's ruby script, it closes and opens the output file only if your separator matches.

In contrast to that, your python script opens a new file descriptor for every line you read (and btw, does not close them). Opening a file requires talking to the kernel, so this is relatively slow.

Another change I would suggest is to change

fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line

to

fh = open(file, "r")
for line in fh:

With this change, you do not read the whole file into memory, but only block after block. Although it should not matter with a 25MiB file, it will hurt you with big files and is good practice (and less code ;)).


The Python code might be slow due to regex and not IO. Try

def splitlines (file) :
  fileNo=0001
  outFile=open("newdocs/%s.txt" % fileNo, 'a') ## open file to append 
  reg = re.compile("Copyright ")
  for line in open(file, "r"): 
    if reg.search("Copyright ", line): # if the line is equal to the regex
      outFile.close()  ##  close the file
      outFile=open("newdocs%s.txt" % fileNo, 'a') ## open file to append 
      fileNo +=1  #and add one to the filename, starting to read lines in again

    outFile.write(line)          ## then append it to the open outFile          

Several notes

  • Always use / instead of \ for path name
  • If regex is used repeatedly, compile it
  • Do you need re.search? or re.match?

UPDATE:

  • @Ed. S: point taken
  • @Winston Ewert: code updated to be closer to the original Ruby code


rosser,

Don't use names of built-in objects as identifiers in a code (file, splitlines)

The following code respects the effect of your own code: an out_file is closed without the line containing 'Copyright ' that constitutes the signal of closing

The use of the function writelines() is intended to obtain a faster execution than with a repetition of out_file.write(line)

The if li: block is there to trigger the closing of out_file in case the last line of the read file doesn't contains 'Copyright '

def splitfile(filename, wordstop, destrep, file_no = 1, li = []):
    with open(filename) as in_file:
        for line in in_file:
            if wordstop in line:
                with open(destrep+str(file_no)+'.txt','w') as f:
                    f.writelines(li)
                file_no += 1
                li = []
            else:
                li.append(line)
        if li:
            with open(destrep+str(file_no)+'.txt','w') as f:
                f.writelines(li)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜