Comparison of two pdf files

2023-03-20 23:11 问答作者：

I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the开发者_如何学C logic.

If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.

Here's a screenshot:

You can do the same thing with a shell script on Linux. The script wraps 3 components:

ImageMagick's compare command
the pdftk utility
Ghostscript

It's rather easy to translate this into a .bat Batch file for DOS/Windows...

Here are the building blocks:

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

compare

Use this command to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.

Once more, pdftk

Now you can again concatenate your "diff" PDF pages with pdftk:

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

Just create an all-white BMP page with its MD5sum (for reference) like this:

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp

I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

Of course, you need to install the ImageMagick bindings first:

sudo apt-get install php5-imagick # Ubuntu/Debian

I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel & highlight the differences.

Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.

To get page count

import com.taguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

To get page content as plain text

//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

To extract attached images from PDF

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

To store PDF pages as images

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);

To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

To compare PDFs on macOS Monterey (i.e. version 12), I was able to install diff-pdf using homebrew, and run it.

The --view option didn't work for me, but the --output-diff did.

继续阅读：comparison pdf pdfbox

Comparison of two pdf files

pdftk

compare

Once more, pdftk

Ghostscript

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

pdftk

compare

Once more, pdftk

Ghostscript

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？