Compare two files line by line and generate the difference in another file
I want to compare file1 with file2 and generate a file3 which contains t开发者_开发知识库he lines in file1 which are not present in file2.
diff(1) is not the answer, but comm(1) is.
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
...
-1 suppress lines unique to FILE1
-2 suppress lines unique to FILE2
-3 suppress lines that appear in both files
So
comm -2 -3 file1 file2 > file3
The input files must be sorted. If they are not, sort them first. This can be done with a temporary file, or...
comm -2 -3 <(sort file1) <(sort file2) > file3
provided that your shell supports process substitution (bash does).
The Unix utility diff
is meant for exactly this purpose.
$ diff -u file1 file2 > file3
See the manual and the Internet for options, different output formats, etc.
Consider this:
file a.txt:
abcd
efgh
file b.txt:
abcd
You can find the difference with:
diff -a --suppress-common-lines -y a.txt b.txt
The output will be:
efgh
You can redirict the output in an output file (c.txt) using:
diff -a --suppress-common-lines -y a.txt b.txt > c.txt
This will answer your question:
"...which contains the lines in file1 which are not present in file2."
Yet, no grep
solution?
lines which are exist only in file2:
grep -Fxvf file1 file2 > file3
lines which are exist only in file1:
grep -Fxvf file2 file1 > file3
lines which are exist in both files:
grep -Fxf file1 file2 > file3
Switches description (see also man grep
):
- The
-F
tells grep to interpret PATTERNS as fixed strings, not regular expressions. - The
-x
tells grep to select only those matches that exactly match the whole line not partiall match. - With the
-f
, grep obtains the patterns from FILE, one per line. - The
-v
just inverts the sense of matching, to select non-matching lines.
Sometimes diff
is the utility you need, but sometimes join
is more appropriate. The files need to be pre-sorted or, if you are using a shell which supports process substitution such as bash, ksh or zsh, you can do the sort on the fly.
join -v 1 <(sort file1) <(sort file2)
Try
sdiff file1 file2
It ususally works much better in most cases for me. You may want to sort files prior, if order of lines is not important (e.g. some text config files).
For example,
sdiff -w 185 file1.cfg file2.cfg
You could use diff
with following output formatting:
diff --old-line-format='' --unchanged-line-format='' file1 file2
--old-line-format=''
, disable output for file1 if line was differ compare in file2.
--unchanged-line-format=''
, disable output if lines were same.
I'm surprised nobody mentioned diff -y
to produce a side-by-side output, for example:
diff -y file1 file2 > file3
And in file3
(different lines have a symbol |
in middle):
same same
diff_1 | diff_2
If you need to solve this with coreutils the accepted answer is good:
comm -23 <(sort file1) <(sort file2) > file3
You can also use sd (stream diff), which doesn't require sorting nor process substitution and supports infinite streams, like so:
cat file1 | sd 'cat file2' > file3
Probably not that much of a benefit on this example, but still consider it; in some cases you won't be able to use comm
nor grep -F
nor diff
.
Here's a blogpost I wrote about diffing streams on the terminal, which introduces sd.
Many answers already, but none of them perfect IMHO. Thanatos' answer leaves some extra characters per line and Sorpigal's answer requires the files to be sorted or pre-sorted, which may not be adequate in all circumstances.
I think the best way of getting the lines that are different and nothing else (no extra chars, no re-ordering) is a combination of diff
, grep
, and awk
(or similar).
If the lines do not contain any "<", a short one-liner can be:
diff urls.txt* | grep "<" | sed 's/< //g'
but that will remove every instance of "< " (less than, space) from the lines, which is not always OK (e.g. source code). The safest option is to use awk:
diff urls.txt* | grep "<" | awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}'
This one-liner diffs both files, then filters out the ed-style output of diff, then removes the trailing "<" that diff adds. This works even if the lines contains some "<" themselves.
diff a1.txt a2.txt | grep '> ' | sed 's/> //' > a3.txt
I tried almost all the answers in this thread, but none was complete. After few trails above one worked for me. diff will give you difference but with some unwanted special charas. where you actual difference lines starts with '> '. so next step is to grep lines starts with '> 'and followed by removing the same with sed.
Use the Diff utility and extract only the lines starting with < in the output
If you have a CSV file with single or even multiple columns, you can do these line by line "diff" operations using the sqlite3 embedded db. It comes with python, so should be available on most linux/macs. You can script the sqlite3 commands on the bash shell without needing to write python.
- Create your a.csv and b.csv files
- Ensure sqlite3 is installed using the command "sqlite3 -help"
- Run the below commands directly on the Linux/Mac shell (or put it in a script)
echo "
.mode csv
.import a.csv atable
.import b.csv btable
create table result as select * from atable EXCEPT select * from btable;
.output result.csv
select * from result ;
.quit
" | sqlite3 temp.db
Note : Ensure there is a newline for each of the sqlite3 commands.
How it works
- Import the 2 csvs into "atable" and "btable" respectively.
- Use the "except" sql operator to select the data available in "atable" but missing in "btable". Create a "result" table using the select query statement
- Output the result table to result.csv by running "select * from result;"
If you need to operate on specific columns, sqlite3 or any db is the way to go.
I have tried diff'ing on multiple GB files using the builtin diff and comm tools. Sqlite beats linux utilities by a mile.
linecount=0
while IFS= read -r line1; do
let linecount=linecount+1
IFS= read -r line2 < $2
if [ "$line1" != "$line2" ] ; then
echo "============diff: $linecount"
echo "LINE1 $line1";
echo "LINE2 $line2";
echo ""
fi
done < $1
精彩评论