开发者

Comparing file contents in Python

I have two files, say source and target. I compare each element in source to check if it also exists in target. If it does not exist in target, I print it ( the end goal is to have 0 difference). Here is the code I have written.

def finddefaulters(source,target):
  f = open(source,'r')
  g = open(target,'r')

  reference = f.readlines()
  done = g.readlines()
  for i in reference:
    if i not in done:
      print i,

I need help with

  1. How would this code be rated on a scale of 1-10
  2. How can I make it better and optimal if the file sizes are huge.

Another question - When I read all the lines as list elements, they are interpreted as 'element\n' - So for correct comparison, I have to add a newline at the end of each file. Is there a way to strip the newlines so 开发者_高级运维I do not have to add newline at the end of files. I tried rstrip. But it did not work. Thanks in advance.


Regarding efficiency: The method you show has an asymptotic runtime complexity of O(m*n) where m and n are the number of elements in reference and done, i.e. if you double the size of both lists, the algorithm will run 4 times longer (times a fixed constant that is uniteresting to theoretical computer scientists). If m and n are very large, you will probably want to choose a faster algorithm, e.g sort the two lists first using the .sort() (runtime complexity: O(n * log(n))) and then go through the lists just once (runtime complexity: O(n)). That algorithm has a worst-case runtime complexity of O(n * log(n)), which is already a big improvement. However, you trade readability and simplicity of the code for efficiency, so I would only advise you to do this if absolutely necessary.

Regarding coding style: You do not .close() the file handles which you should. Instead of opening and closing the file handle, you could use the with language construct of python. Also, if you like the functional style, you could replace the for loop by a list expression:

for i in reference:
    if i not in done:
        print i,

then becomes:

items = [i.strip() for i in reference if i not in done]
print ' '.join(items)

However, this way you will not see any progress while the list is being composed.

As joaquin already mentions, you can loop over f directly instead of f.readlines() as file handles support the iterator protocol.


Some ideas:

1) use [with] to open files safely:

with open(source) as f:
     .............

The with statement is used to wrap the execution of a block with methods defined by a context manager. This allows common try...except...finally usage patterns to be encapsulated for convenient reuse.

2) you can iterate over the lines of a file instead of using readlines:

for line in f:
     ..........

3) Although for this short snippet it could be enough, try to use more informative names for your variables. One-letter names are not recommended.

4) If you want to get profit of python lib, try functions in difflib module. For example:

make_file(fromlines, tolines[, fromdesc][, todesc][, context][, numlines]) 

Compares fromlines and tolines (lists of strings) and returns a string which is a complete HTML file containing a table showing line by line differences with inter-line and intra-line changes highlighted.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜