开发者

PDF compare on linux command line [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed yesterday.

开发者_开发技巧

The community reviewed whether to reopen this question yesterday and left it closed:

Original close reason(s) were not resolved

Improve this question

I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.

Something like:

<tool> file1.pdf file2.pdf -o diff-out.pdf

Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.

Any other solution is also welcome.


I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:

  1. ImageMagick's compare command
  2. the pdftk utility (if you have multipage PDFs)
  3. Ghostscript (optional)
  4. md5sum (optional)

It should be quite easy to port this to a .bat batch file for DOS/Windows.

But first, please note: this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:

  • Each pixel that remains unchanged becomes white.
  • Each pixel that got changed is painted in red.

That diff image is saved as a new PDF to make it better accessible on different OS platforms.

I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.

It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.

Here are the building blocks:

pdftk

Use this command line utility to split multipage PDF files into multiple singlepage PDFs:

pdftk  file_1.pdf  burst  output  somewhere/file_1---page_%03d.pdf
pdftk  file_2.pdf  burst  output  somewhere/file_2---page_%03d.pdf

If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.

compare

Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder \
       -log "%u %m:%l %e" \
        somewhere/file_1---page_001.pdf \
        somewhere/file_2---page_001.pdf \
       -compose src \
        somewhereelse/file_1--file_2---diff_page_001.pdf

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256 output device. You can do that like this:

First, find out what the page size format of your PDF is. Again, this little utility identify comes as part of any ImageMagick installation:

 identify \
   -format "%[fx:(w)]x%[fx:(h)]" \
    somewhereelse/file_1--file_2---diff_page_001.pdf

You can store this value in an environment variable like this:

 export my_size=$(identify \
   -format "%[fx:(w)]x%[fx:(h)]" \
    somewhereelse/file_1--file_2---diff_page_001.pdf)

Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:

 gs \
   -o somewhereelse/file_1--file_2---diff_page_001.ppm \
   -sDEVICE=ppmraw \
   -r72 \
   -g${my_size} \
    somewhereelse/file_1--file_2---diff_page_001.pdf

This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:

 gs \
   -o somewhereelse/file_1--file_2---whitepage_001.ppm \
   -sDEVICE=ppmraw \
   -r72 \
   -g${my_size} \
   -c "showpage"

The -c "showpage" part is a PostScript command that tells Ghostscript to emit an empty page only.

MD5 sum

Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:

 MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print $1}')
 MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print $1}')

 if [ "x${MD5_1}" == "x${MD5_2}" ]; then 
     mv  \
       somewhereelse/file_1--file_2---diff_page_001.pdf \
       somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF
     rm  \
       somewhereelse/file_1--file_2---*_page_001.ppm            # delete both PPMs
 fi

This spares you from having to visually inspect "diff PDFs" that do not have any differences.


Here is a hack to do it.

pdftotext file1.pdf
pdftotext file2.pdf
diff file1.txt file2.txt


Done in 2 lines with (the allmighty) imagemagick and pdftk:

compare -verbose -debug coder $PDF_1 $PDF_2 -compose src $OUT_FILE.tmp
pdftk $OUT_FILE.tmp background $PDF_1 output $OUT_FILE

The options -verbose and -debug are optional.

  • compare creates a PDF with the diff as red pixels.
  • pdftk merges the diff-pdf with background PDF_1


In 2022, the answers based on applying compare directly to PDF files are not working for me. It seems that this command no longer handles PDFs properly.

However, compare does work when applied to PNG files.

I have adopted bits and pieces from the previous answers to write a different script. In fact, two different scripts, doing slightly different things: ComparePdfs.sh and ComparePdfs2.sh, to be executed on the command line. Both scripts are listed at the end of this answer.

Some caveats

These two scripts are comparing the two PDF files page by page, and each pair of pages is compared purely visually (since the pages are converted to PNG). So the scripts are sensitive only to flat text and flat graphics. If the only difference between the two PDF files concerns some other kind of PDF content—such as logical structuring elements, annotations, form-fields, layers, videos, 3D objects (U3D or PRC), etc.—both scripts will nevertheless report that the two PDFs are the same.

I haven't tried to compare PDFs specifically as far as some of this 'extra' kind of content.

How to tell if two files (PDF or not) have completely identical content

The only other kind of comparison I know how to do is the one that lets us know if the contents of the two PDF files are completely identical in every respect, including the various embedded metadata, such as the creation date, the document’s title (which has nothing to do with any title displayed on the first page), the program used to create the PDF, and so on.

It's the same method that can be used to check if any two files (PDF or not) are bit-by-bit identical.

To do this, all you have to do is compute and compare checksums for the two files. I'm including a script for that as well, called AreIdentical.sh. It is listed at the very end of this question. Here is how to use it.

Suppose the two files are named "my_first_PDF_file.pdf" and "another_PDF_file.pdf". Then, once you execute the following on the command line, the output text will read "same" or "different" depending on whether the two files are the same or different.

AreIdentical.sh my_first_PDF_file.pdf another_PDF_file.pdf

Note that information such as the file's name is not considered when the checksums are computed. The reason is that the name of the file is stored not within the file itself but in the directory entry of the file. So two files may be found to be identical even if their file names are different; see this question. Similarly, the creation date as returned by ls -l (as opposed to the one that's in the PDF's embedded metadata) is also not considered when checksums are computed, for the same reason.

How to use the scripts ComparePdfs.sh and ComparePdfs2.sh

We assume that the two pdf files to be (purely visually) compared, file1.pdf and file2.pdf, are in the working directory.

As an example, assume that they both have 4 pages, and that all pages are identical except page 3.

To do exactly what the OP asked for,

on the command line, we execute

ComparePdfs2.sh file1.pdf file2.pdf dif_in_files.pdf

where I picked a particular name, dif_in_files.pdf, for the outfile. The execution takes a bit of time because for both input PDF files, each individual page must be converted to PNG. The current page being processed is printed in the terminal. At the end, in the working directory, the script will produce the file dif_in_files.pdf, which contains the difference pages for all the pages. Any differences are highlighted in red.

If we are only interested to see the pages that are different, or only interested to see if they are different, then we use ComparePdfs.sh.

On the command line, we execute

ComparePdfs.sh file1.pdf file2.pdf

In the terminal, the script will output the following:

page_001: same
page_002: same
page_003: different
page_004: same

For the pages that turned out different, and only those pages, the script will create files that highlight the differences. In the above example, the script would generate just one file, called difference_page_003.png.

How ComparePdfs.sh works

For each of the two pdf files, we use pdftk to burst it into individual pages, and then convert each page to PNG. Now we consider the PNGs of the first pages of the two files. We create a checksum for each (I chose to use b2sum to do that).

If the checksums are the same, we take the first pages of the two files to be the same.

If the checksums are different, we take the first pages of the two files to be different, and use compare to generate a difference PNG file for them.

We repeat this for each page. At the end, we erase all the .pdf and .png files of the individual pages, except for the difference files.

The scripts

Here is ComparePdfs2.sh.

#!/bin/bash
file_1="$1"
file_2="$2"
outfile="$3"

# here we set the DPI resolution for the pdftoppm command, which will convert PDF to PNG
resolution=150

# bursting the files into individual pages
pdftk  $file_1  burst  output ${file_1%.*}---page_%03d.pdf
pdftk  $file_2  burst  output ${file_2%.*}---page_%03d.pdf

# this will be a string variable in which we collect that names of .png files to be converted to a single .pdf file
DiffFiles=""

# we loop over the individual pages of the first file
for f1 in `echo ${file_1%.*}---`*.pdf 
do 

  # f2 is the name of the PDF of the corresponding page of the second file
  f2="${f1/${file_1%.*}/${file_2%.*}}" 
  
  # 'b' is an auxilliary varable used to create the variable 'page'
  b="${f1/${file_1%.*}---/""}" 
  
  # 'page' hold the current page number, e.g. 'page_003'
  page="${b/.pdf/}" 
  
  # print the current page being processed
  echo -n "$page "
  
  # convert the individual page PDFs to PNGs
  pdftoppm "$f1" "${f1%.*}" -png -r $resolution
  pdftoppm "$f2" "${f2%.*}" -png -r $resolution
  
  # 'g1' and 'g2' are the names of the two PNG files we just created
  g1=${f1%.*}-1.png
  g2=${f2%.*}-1.png 
  
  # create the difference file for this page
  compare "$g1" "$g2" ${outfile%.*}_"$page".png

  # add the latest name of the difference .png file to the DiffFiles variable
  DiffFiles=$DiffFiles""${outfile%.*}_"$page".png" "
done
echo

# convert the .png difference files to a single .pdf file
convert $DiffFiles $outfile

# clean up
rm -f `echo ${file_1%.*}---page_`* `echo ${file_2%.*}---page_`* `echo ${outfile%.*}_page_`* doc_data.txt

Here is ComparePdfs.sh

#!/bin/bash
file_1="$1"
file_2="$2"

# here we set the DPI resolution for the pdftoppm command, which will convert PDF to PNG
resolution=150

# bursting the files into individual pages
pdftk  $file_1  burst  output ${file_1%.*}---page_%03d.pdf
pdftk  $file_2  burst  output ${file_2%.*}---page_%03d.pdf

# we loop over the individual pages of the first file
for f1 in `echo ${file_1%.*}---`*.pdf 
do 
  # f2 is the name of the PDF of the corresponding page of the second file
  f2="${f1/${file_1%.*}/${file_2%.*}}" 
  
  # 'b' is an auxilliary varable used to create the variable 'page'
  b="${f1/${file_1%.*}---/""}" 
  
  # 'page' hold the current page number, e.g. 'page_003'
  page="${b/.pdf/}" 
  
  # convert the individual page PDFs to PNGs
  pdftoppm "$f1" "${f1%.*}" -png -r $resolution
  pdftoppm "$f2" "${f2%.*}" -png -r $resolution
  
  # 'g1' and 'g2' are the names of the two PNG files we just created
  g1=${f1%.*}-1.png
  g2=${f2%.*}-1.png 
  
  # create the checksums for the two PNG files
  B2S_1=$(b2sum "$g1" | awk '{print $1}') 
  B2S_2=$(b2sum "$g2" | awk '{print $1}') 
  
  # now we compare the checksums
  if [ "$B2S_1" = "$B2S_2" ]; then 
       echo "$page: same"; 
  else 
       echo "$page: different"; 
       # if the checksums are different, create a difference PNG image
       compare "$g1" "$g2" difference_"$page".png 
  fi
done

# clean up
rm -f `echo ${file_1%.*}---page_`* `echo ${file_2%.*}---page_`* doc_data.txt

Finally, here is AreIdentical.sh:

#!/bin/bash
file_1="$1"
file_2="$2"
B2S_1=$(b2sum $file_1 | awk '{print $1}')
B2S_2=$(b2sum $file_2 | awk '{print $1}')
if [ "$B2S_1" = "$B2S_2" ]; then echo "same"; else echo "different"; fi


Here is a finished script, "cmppdf", based on linguisticturn's code plus support for comparing the text in the PDFs and some polishing:

https://abhweb.org/jima/cmppdf

Documentation:

 NAME
    cmppdf -- Compare the visual appearance or text of PDF files

 SYNOPSIS
    cmppdf        [-o BASEPATH] [-q] [-d] FILE1 FILE2
    cmppdf --text [-o BASEPATH] [-q] [-d] FILE1 FILE2

 EXIT STATUS
   0  if no differences found
   1  if differences found
   2+ if trouble

 OPTIONS
   -t, --text      Compare the text in the PDFs, ignoring grapical appearance.

   -o, --output BASEPATH

     With this option a "difference file" named BASEPATH_page_NNN.png
     or .txt is created for each page which has differences.
     With visual comparison (the default), the files will be .png images with
     changed parts highlighted in RED.   With text comparison (--text option),
     the files will contain output from the 'diff' command, or if BASEPATH
     is '-' then all diffs are written to stdout.

   --diff diff-option1,diff-option2, ...

     Specify options to pass to the 'diff' command, separated by commas.
     The default is '-u'.  --text is implied by --diff.

   -q, --quiet     Suppress all progress messages

   -d, --debug     Show detailed information about commands run

@linguisticturn: Please contact me at the email given in the script so I can give you proper credit!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜