PDF compare on linux command line [closed]
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed yesterday.
开发者_开发技巧The community reviewed whether to reopen this question yesterday and left it closed:
Improve this questionOriginal close reason(s) were not resolved
I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.
Something like:
<tool> file1.pdf file2.pdf -o diff-out.pdf
Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.
Any other solution is also welcome.
I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:
- ImageMagick's
compare
command - the
pdftk
utility (if you have multipage PDFs) - Ghostscript (optional)
md5sum
(optional)
It should be quite easy to port this to a .bat
batch file for DOS/Windows.
But first, please note: this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:
- Each pixel that remains unchanged becomes white.
- Each pixel that got changed is painted in red.
That diff image is saved as a new PDF to make it better accessible on different OS platforms.
I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.
It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.
Here are the building blocks:
pdftk
Use this command line utility to split multipage PDF files into multiple singlepage PDFs:
pdftk file_1.pdf burst output somewhere/file_1---page_%03d.pdf
pdftk file_2.pdf burst output somewhere/file_2---page_%03d.pdf
If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.
compare
Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder \
-log "%u %m:%l %e" \
somewhere/file_1---page_001.pdf \
somewhere/file_2---page_001.pdf \
-compose src \
somewhereelse/file_1--file_2---diff_page_001.pdf
Ghostscript
Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256
output device. You can do that like this:
First, find out what the page size format of your PDF is. Again, this little utility identify
comes as part of any ImageMagick installation:
identify \
-format "%[fx:(w)]x%[fx:(h)]" \
somewhereelse/file_1--file_2---diff_page_001.pdf
You can store this value in an environment variable like this:
export my_size=$(identify \
-format "%[fx:(w)]x%[fx:(h)]" \
somewhereelse/file_1--file_2---diff_page_001.pdf)
Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:
gs \
-o somewhereelse/file_1--file_2---diff_page_001.ppm \
-sDEVICE=ppmraw \
-r72 \
-g${my_size} \
somewhereelse/file_1--file_2---diff_page_001.pdf
This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:
gs \
-o somewhereelse/file_1--file_2---whitepage_001.ppm \
-sDEVICE=ppmraw \
-r72 \
-g${my_size} \
-c "showpage"
The -c "showpage"
part is a PostScript command that tells Ghostscript to emit an empty page only.
MD5 sum
Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:
MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print $1}')
MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print $1}')
if [ "x${MD5_1}" == "x${MD5_2}" ]; then
mv \
somewhereelse/file_1--file_2---diff_page_001.pdf \
somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF
rm \
somewhereelse/file_1--file_2---*_page_001.ppm # delete both PPMs
fi
This spares you from having to visually inspect "diff PDFs" that do not have any differences.
Here is a hack to do it.
pdftotext file1.pdf
pdftotext file2.pdf
diff file1.txt file2.txt
Done in 2 lines with (the allmighty) imagemagick and pdftk:
compare -verbose -debug coder $PDF_1 $PDF_2 -compose src $OUT_FILE.tmp
pdftk $OUT_FILE.tmp background $PDF_1 output $OUT_FILE
The options -verbose and -debug are optional.
- compare creates a PDF with the diff as red pixels.
- pdftk merges the diff-pdf with background PDF_1
In 2022, the answers based on applying compare
directly to PDF files are not working for me. It seems that this command no longer handles PDFs properly.
However, compare
does work when applied to PNG files.
I have adopted bits and pieces from the previous answers to write a different script. In fact, two different scripts, doing slightly different things: ComparePdfs.sh
and ComparePdfs2.sh
, to be executed on the command line. Both scripts are listed at the end of this answer.
Some caveats
These two scripts are comparing the two PDF files page by page, and each pair of pages is compared purely visually (since the pages are converted to PNG). So the scripts are sensitive only to flat text and flat graphics. If the only difference between the two PDF files concerns some other kind of PDF content—such as logical structuring elements, annotations, form-fields, layers, videos, 3D objects (U3D or PRC), etc.—both scripts will nevertheless report that the two PDFs are the same.
I haven't tried to compare PDFs specifically as far as some of this 'extra' kind of content.
How to tell if two files (PDF or not) have completely identical content
The only other kind of comparison I know how to do is the one that lets us know if the contents of the two PDF files are completely identical in every respect, including the various embedded metadata, such as the creation date, the document’s title (which has nothing to do with any title displayed on the first page), the program used to create the PDF, and so on.
It's the same method that can be used to check if any two files (PDF or not) are bit-by-bit identical.
To do this, all you have to do is compute and compare checksums for the two files. I'm including a script for that as well, called AreIdentical.sh
. It is listed at the very end of this question. Here is how to use it.
Suppose the two files are named "my_first_PDF_file.pdf" and "another_PDF_file.pdf". Then, once you execute the following on the command line, the output text will read "same" or "different" depending on whether the two files are the same or different.
AreIdentical.sh my_first_PDF_file.pdf another_PDF_file.pdf
Note that information such as the file's name is not considered when the checksums are computed. The reason is that the name of the file is stored not within the file itself but in the directory entry of the file. So two files may be found to be identical even if their file names are different; see this question. Similarly, the creation date as returned by ls -l
(as opposed to the one that's in the PDF's embedded metadata) is also not considered when checksums are computed, for the same reason.
How to use the scripts ComparePdfs.sh
and ComparePdfs2.sh
We assume that the two pdf files to be (purely visually) compared, file1.pdf and file2.pdf, are in the working directory.
As an example, assume that they both have 4 pages, and that all pages are identical except page 3.
To do exactly what the OP asked for,
on the command line, we execute
ComparePdfs2.sh file1.pdf file2.pdf dif_in_files.pdf
where I picked a particular name, dif_in_files.pdf
, for the outfile. The execution takes a bit of time because for both input PDF files, each individual page must be converted to PNG. The current page being processed is printed in the terminal. At the end, in the working directory, the script will produce the file dif_in_files.pdf
, which contains the difference pages for all the pages. Any differences are highlighted in red.
If we are only interested to see the pages that are different, or only interested to see if they are different, then we use ComparePdfs.sh
.
On the command line, we execute
ComparePdfs.sh file1.pdf file2.pdf
In the terminal, the script will output the following:
page_001: same
page_002: same
page_003: different
page_004: same
For the pages that turned out different, and only those pages, the script will create files that highlight the differences. In the above example, the script would generate just one file, called difference_page_003.png
.
How ComparePdfs.sh
works
For each of the two pdf files, we use pdftk to burst it into individual pages, and then convert each page to PNG. Now we consider the PNGs of the first pages of the two files. We create a checksum for each (I chose to use b2sum
to do that).
If the checksums are the same, we take the first pages of the two files to be the same.
If the checksums are different, we take the first pages of the two files to be different, and use compare
to generate a difference PNG file for them.
We repeat this for each page. At the end, we erase all the .pdf and .png files of the individual pages, except for the difference files.
The scripts
Here is ComparePdfs2.sh
.
#!/bin/bash
file_1="$1"
file_2="$2"
outfile="$3"
# here we set the DPI resolution for the pdftoppm command, which will convert PDF to PNG
resolution=150
# bursting the files into individual pages
pdftk $file_1 burst output ${file_1%.*}---page_%03d.pdf
pdftk $file_2 burst output ${file_2%.*}---page_%03d.pdf
# this will be a string variable in which we collect that names of .png files to be converted to a single .pdf file
DiffFiles=""
# we loop over the individual pages of the first file
for f1 in `echo ${file_1%.*}---`*.pdf
do
# f2 is the name of the PDF of the corresponding page of the second file
f2="${f1/${file_1%.*}/${file_2%.*}}"
# 'b' is an auxilliary varable used to create the variable 'page'
b="${f1/${file_1%.*}---/""}"
# 'page' hold the current page number, e.g. 'page_003'
page="${b/.pdf/}"
# print the current page being processed
echo -n "$page "
# convert the individual page PDFs to PNGs
pdftoppm "$f1" "${f1%.*}" -png -r $resolution
pdftoppm "$f2" "${f2%.*}" -png -r $resolution
# 'g1' and 'g2' are the names of the two PNG files we just created
g1=${f1%.*}-1.png
g2=${f2%.*}-1.png
# create the difference file for this page
compare "$g1" "$g2" ${outfile%.*}_"$page".png
# add the latest name of the difference .png file to the DiffFiles variable
DiffFiles=$DiffFiles""${outfile%.*}_"$page".png" "
done
echo
# convert the .png difference files to a single .pdf file
convert $DiffFiles $outfile
# clean up
rm -f `echo ${file_1%.*}---page_`* `echo ${file_2%.*}---page_`* `echo ${outfile%.*}_page_`* doc_data.txt
Here is ComparePdfs.sh
#!/bin/bash
file_1="$1"
file_2="$2"
# here we set the DPI resolution for the pdftoppm command, which will convert PDF to PNG
resolution=150
# bursting the files into individual pages
pdftk $file_1 burst output ${file_1%.*}---page_%03d.pdf
pdftk $file_2 burst output ${file_2%.*}---page_%03d.pdf
# we loop over the individual pages of the first file
for f1 in `echo ${file_1%.*}---`*.pdf
do
# f2 is the name of the PDF of the corresponding page of the second file
f2="${f1/${file_1%.*}/${file_2%.*}}"
# 'b' is an auxilliary varable used to create the variable 'page'
b="${f1/${file_1%.*}---/""}"
# 'page' hold the current page number, e.g. 'page_003'
page="${b/.pdf/}"
# convert the individual page PDFs to PNGs
pdftoppm "$f1" "${f1%.*}" -png -r $resolution
pdftoppm "$f2" "${f2%.*}" -png -r $resolution
# 'g1' and 'g2' are the names of the two PNG files we just created
g1=${f1%.*}-1.png
g2=${f2%.*}-1.png
# create the checksums for the two PNG files
B2S_1=$(b2sum "$g1" | awk '{print $1}')
B2S_2=$(b2sum "$g2" | awk '{print $1}')
# now we compare the checksums
if [ "$B2S_1" = "$B2S_2" ]; then
echo "$page: same";
else
echo "$page: different";
# if the checksums are different, create a difference PNG image
compare "$g1" "$g2" difference_"$page".png
fi
done
# clean up
rm -f `echo ${file_1%.*}---page_`* `echo ${file_2%.*}---page_`* doc_data.txt
Finally, here is AreIdentical.sh
:
#!/bin/bash
file_1="$1"
file_2="$2"
B2S_1=$(b2sum $file_1 | awk '{print $1}')
B2S_2=$(b2sum $file_2 | awk '{print $1}')
if [ "$B2S_1" = "$B2S_2" ]; then echo "same"; else echo "different"; fi
Here is a finished script, "cmppdf", based on linguisticturn's code plus support for comparing the text in the PDFs and some polishing:
https://abhweb.org/jima/cmppdf
Documentation:
NAME
cmppdf -- Compare the visual appearance or text of PDF files
SYNOPSIS
cmppdf [-o BASEPATH] [-q] [-d] FILE1 FILE2
cmppdf --text [-o BASEPATH] [-q] [-d] FILE1 FILE2
EXIT STATUS
0 if no differences found
1 if differences found
2+ if trouble
OPTIONS
-t, --text Compare the text in the PDFs, ignoring grapical appearance.
-o, --output BASEPATH
With this option a "difference file" named BASEPATH_page_NNN.png
or .txt is created for each page which has differences.
With visual comparison (the default), the files will be .png images with
changed parts highlighted in RED. With text comparison (--text option),
the files will contain output from the 'diff' command, or if BASEPATH
is '-' then all diffs are written to stdout.
--diff diff-option1,diff-option2, ...
Specify options to pass to the 'diff' command, separated by commas.
The default is '-u'. --text is implied by --diff.
-q, --quiet Suppress all progress messages
-d, --debug Show detailed information about commands run
@linguisticturn: Please contact me at the email given in the script so I can give you proper credit!
精彩评论