Slow python file I:O; Ruby runs better than this; Got the wrong language?

2023-02-18 16:39 问答作者：

Please advise - I'm going to use this asa learning point. I'm a beginner.

I'm splitting a 25mb file into several smaller file.

A Kindly guru here gave me a Ruby sript. It works beautifully fast. So, in order to learn I mimicked it with a python script. This runs like a three-legged cat (slow). I wonder if anyone can tell me why?

My python script

    ##split a file into smaller files
###########################################
def splitlines (file) :
        fileNo=0001
        outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
        fh = open(file, "r") ## open the file for reading
        mylines = fh.readlines() ### read in lines
        for line in mylines: ## for each line
                        if re.search("Copyright ", line): # if the line is equal to the regex
                  开发者_StackOverflow社区          outFile.close()  ##  close the file
                            fileNo +=1  #and add one to the filename, starting to read lines in again
                        else: # otherwise
                            outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
                            outFile.write(line)          ## then append it to the open outFile          
        fh.close()

The guru's Ruby 1.9 script

g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
  if line[/\d+ of \d+ DOCUMENTS/]
    f.close
    f=File.open(g.to_s + ".txt","w")
    g+=1
  end
  f.print line
end

There are many reasons why your script is slow -- the main reason being that you reopen the outputfile for almost every line you write. Since the old file gets implicitly closed on opening a new one (due to Python garbage collection), the write buffer is flushed for every single line you write, which is quite expensive.

A cleaned up and corrected version of your script would be

def file_generator():
    file_no = 1
    while True:
        f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
                 r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
        yield f
        f.close()
        file_no += 1

def splitlines(filename):
    files = file_generator()
    out_file = next(files)
    with open(filename) as in_file:
        for line in in_file:
            if "Copyright " in line:
                out_file = next(files)
            out_file.write(line)
        out_file.close()

I guess the reason your script is so slow is that you open a new file descriptor for each line. If you look at your guru's ruby script, it closes and opens the output file only if your separator matches.

In contrast to that, your python script opens a new file descriptor for every line you read (and btw, does not close them). Opening a file requires talking to the kernel, so this is relatively slow.

Another change I would suggest is to change

fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line

fh = open(file, "r")
for line in fh:

With this change, you do not read the whole file into memory, but only block after block. Although it should not matter with a 25MiB file, it will hurt you with big files and is good practice (and less code ;)).

~~The Python code might be slow due to regex and not IO.~~ Try

def splitlines (file) :
  fileNo=0001
  outFile=open("newdocs/%s.txt" % fileNo, 'a') ## open file to append 
  reg = re.compile("Copyright ")
  for line in open(file, "r"): 
    if reg.search("Copyright ", line): # if the line is equal to the regex
      outFile.close()  ##  close the file
      outFile=open("newdocs%s.txt" % fileNo, 'a') ## open file to append 
      fileNo +=1  #and add one to the filename, starting to read lines in again

    outFile.write(line)          ## then append it to the open outFile

Several notes

Always use / instead of \ for path name
If regex is used repeatedly, compile it
Do you need re.search? or re.match?

UPDATE:

@Ed. S: point taken
@Winston Ewert: code updated to be closer to the original Ruby code

rosser,

Don't use names of built-in objects as identifiers in a code (file, splitlines)

The following code respects the effect of your own code: an out_file is closed without the line containing 'Copyright ' that constitutes the signal of closing

The use of the function writelines() is intended to obtain a faster execution than with a repetition of out_file.write(line)

The if li: block is there to trigger the closing of out_file in case the last line of the read file doesn't contains 'Copyright '

def splitfile(filename, wordstop, destrep, file_no = 1, li = []):
    with open(filename) as in_file:
        for line in in_file:
            if wordstop in line:
                with open(destrep+str(file_no)+'.txt','w') as f:
                    f.writelines(li)
                file_no += 1
                li = []
            else:
                li.append(line)
        if li:
            with open(destrep+str(file_no)+'.txt','w') as f:
                f.writelines(li)

继续阅读：python regex ruby text

Slow python file I:O; Ruby runs better than this; Got the wrong language?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？