开发者

Comparing Two Text Files, Removing the duplicate lines, and Writing results to a new text file

I have two text files (that are not equal in number of lines/size). I would like to compare each line of the shorter text file with every line of the longer text file. As it compares, if there are any duplicate strings, I would like to have those removed. Lastly, I would like write the result to a new text file and print the contents.

Is there a simply script that can do this for me?

Any help would be much appreciated.

The text files a开发者_如何转开发re not very large. One has about 10 lines and the other has about 5. The code I have tried (that failed miserably) is below:

for line in file2:
line1 = line
for line in file1:
    requested3 = file('request2.txt','a')
    if fnmatch.fnmatch(line1,line):
        line2 = line.replace(line,"")
        requested3.write(line2)
    if not fnmatch.fnmatch(line1,line):
        requested3.write(line+'\n')


    requested3.close()


with open(longfilename) as longfile, open(shortfilename) as shortfile, open(newfilename, 'w') as newfile:
    newfile.writelines(line for line in shortfile if line not in set(longfile))

It's as simple as that. This will copy lines from shortfile to newfile, without having to keep them all in memory, if they also exist in longfile.

If you're on Python 2.6 or older, you would need to nest the with statements:

with open(longfilename) as longfile: 
    with open(shortfilename) as shortfile:
        with open(newfilename, 'w') as newfile:

If you're on Python 2.5, you need to either:

from __future__ import with_statement 

at the very top of your file, or just use

longfile = open(longfilename) 

etc. and close each file yourself.

If you need to manipulate the lines, an explicit for loop is fine, the important part is set(). Looking up an item in a set is fast, looking up a line in a long list is slow.

longlines = set(line.strip_or_whatever() for line in longfile)
for line in shortfile:
    if line not in longlines:
        newfile.write(line)


Assuming the files are both plain text, each string is on a new line delimited with \n newline characters:

small_file = open('file1.txt','r')
long_file = open('file2.txt','r')
output_file = open('output_file.txt','w')

try:
    small_lines = small_file.readlines()
    long_lines = long_file.readlines()
    small_lines_cleaned = [line.rstrip().lower() for line in small_lines]
    long_file_lines = long_file.readlines()
    long_lines_cleaned = [line.rstrip().lower() for line in long_lines]

    for line in small_lines_cleaned:
        if line not in long_lines_cleaned:
            output_file.writelines(line + '\n')

finally:
    small_file.close()
    long_file.close()
    output_file.close()

Explanation:

  1. Since you can't get 'with' statements working, we open the files first using regular open functions, then use a try...finally clause to close them at the end of the program.
  2. We take the small file and the long file and first remove any trailing '\n' (newline) characters with .rstrip(), then make all the characters lower-case with .lower(). If you have two sentences identical in every aspect except one has upper case letters and the other doesn't, they wont' match. Forcing them lower case avoids that; if you prefer a case-sensitive compare, remove the .lower() method.
  3. We go line by line in small_lines_cleaned (for line in...) and see if it is in the larger file.
  4. Output each line if it is not in the longer file; we add the '\n' newline character so that each line will appear on a new line, insteadOfOneGiantLongSetOfStrings


I'd use difflib, it makes it easy to do comparisons/diffs. There is a nice tutorial for it here. If you just wanted the lines that were unique to the shorter file:

from difflib import ndiff

short = open('short.txt').readlines()
long = open('long.txt').readlines()

with open('unique.txt', 'w') as f:
    f.write(''.join(x[2:] for x in ndiff(short, long) if x.startswith('-')))


Your code as it stands checks each line against the line in the other file. But that's not what you want. For each line in the first file, you need to check whether any line in the other file matches and then print it out if there are no matches.


The following code reads file two and checks it against file one.Anything that's in file one but not in file two will get printed and also written to a new text file.

If you wanted to do the opposite, you'd just get rid of the "not" from if statement below. So it'd print anything that's in file one and in file two.

It works by putting the contents of the shorter file (file two) in a variable and then reading the longer file (file one) line by line. Each line is checked against the variable and then the line is either written or not written to the text file according to it's presence in the variable.

(Remember to remove the stars surrounding the not statement if you wish to use it, or removing the not statement all together if you want it to print the matching words.)

fileOne = open("LONG FILE.ext","r")
fileTwo = open("SHORT FILE.ext","r")
fileThree = open("Results.txt","a+")

contents = fileTwo.read()

numLines = sum(1 for line in fileOne)
for i in range (numLines):
    if **not** fileOne.readline(i) in contents:
        print (fileOne.readline(i))
        fileThree.write (fileOne.readline(i))
        
fileOne.close()
fileTwo.close()
fileThree.close()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜