Running "wc -l <filename>" within Python Code
I want to do 10-fold 开发者_如何学Gocross-validation for huge files ( running into hundreds of thousands of lines each). I want to do a "wc -l " each time i start reading a file, then generate random numbers a fixed number of times, each time writing that line number into a separate file . I am using this:
import os
for i in files:
os.system("wc -l <insert filename>").
How do I insert the file name there. Its a variable. I went through the documentation but they mostly list out ls
commands, something that doesn't have this problem.
Let's compare:
from subprocess import check_output
def wc(filename):
return int(check_output(["wc", "-l", filename]).split()[0])
def native(filename):
c = 0
with open(filename) as file:
while True:
chunk = file.read(10 ** 7)
if chunk == "":
return c
c += chunk.count("\n")
def iterate(filename):
with open(filename) as file:
for i, line in enumerate(file):
pass
return i + 1
Go go timeit function!
from timeit import timeit
from sys import argv
filename = argv[1]
def testwc():
wc(filename)
def testnative():
native(filename)
def testiterate():
iterate(filename)
print "wc", timeit(testwc, number=10)
print "native", timeit(testnative, number=10)
print "iterate", timeit(testiterate, number=10)
Result:
wc 1.25185894966
native 2.47028398514
iterate 2.40715694427
So, wc is about twice as fast on a 150 MB compressed files with ~500 000 linebreaks, which is what I tested on. However, testing on a file generated with seq 3000000 >bigfile
, I get these numbers:
wc 0.425990104675
native 0.400163888931
iterate 3.10369205475
Hey look, python FTW! However, using longer lines (~70 chars):
wc 1.60881590843
native 3.24313092232
iterate 4.92839002609
So conclusion: it depends, but wc seems to be the best bet allround.
import subprocess
for f in files:
subprocess.call(['wc', '-l', f])
Also have a look at http://docs.python.org/library/subprocess.html#convenience-functions - for example, if you want to access the output in a string, you'll want to use subprocess.check_output()
instead of subprocess.call()
No need to use wc -l
Use the following python function
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f, 1):
pass
return i
This is probably more efficient than calling an external utility (that loop over the input in a similar fashion).
Update
Dead wrong, wc -l
is a lot faster!
seq 10000000 > huge_file
$ time wc -l huge_file
10000000 huge_file
real 0m0.267s
user 0m0.110s
sys 0m0.010s
$ time ./p.py
10000000
real 0m1.583s
user 0m1.040s
sys 0m0.060s
os.system
gets a string. Just build the string explicitly:
import os
for i in files:
os.system("wc -l " + i)
Here is a Python approach I found to solve this problem:
count_of_lines_in_any_textFile = sum(1 for l in open('any_textFile.txt'))
I found a much more simple way:
import os
linux_shell='more /etc/hosts|wc -l'
linux_shell_result=os.popen(linux_shell).read()
print(linux_shell_result)
My solution is very similar to the “native” function by lazyr:
import functools
def file_len2(fname):
with open(fname, 'rb') as f:
lines= 0
reader= functools.partial(f.read, 131072)
for datum in iter(reader, ''):
lines+= datum.count('\n')
last_wasnt_nl= datum[-1] != '\n'
return lines + last_wasnt_nl
This, unlike wc
, considers a final line not ending with '\n' as a separate line. If one wants the same functionality as wc, then it can be (quite unpythonically :) written as:
import functools as ft, itertools as it, operator as op
def file_len3(fname):
with open(fname, 'rb') as f:
reader= ft.partial(f.read, 131072)
counter= op.methodcaller('count', '\n')
return sum(it.imap(counter, iter(reader, '')))
with comparable times to wc
in all test files I produced.
NB: this applies to Windows and POSIX machines. Old MacOS used '\r' as line-end characters.
精彩评论