What is the best way of comparing two pdf file in c#?

2023-01-29 01:43 问答作者：

I want to check the text content of two PDF f开发者_如何学Cile in C#.

If they are identical you can do a binary comparison. If for contextual comparison you probably need a PDF library. Here are some libraries.

Not going to be easy, but I guess first step would be to get a decent PDF library that can extract the text from PDFs. One I've used is ITextSharp available from http://itextpdf.com/ (open-source). Then try a diff library, such as DIffer: a reusable C# diffing utility and class library. Good luck!

It's been awhile, but this function worked for me (but no guarantees... I don't remember if I tried it on PDF's with embedded images or anything). There is a GUID or some sort of ID embedded in the file, you just need to remove that and compare everything else. Here's the code:

    static bool ComparePDFs(string file1, string file2)
    {
        if (!File.Exists(file2))
            return false;

        int i;
        string f1 = File.ReadAllText(file1);
        string f2 = File.ReadAllText(file2);

        if (f1.Length != f2.Length)
            return false;

        // Remove PDF ID from file1
        i = f1.LastIndexOf("/ID [<");
        if (i < 0)
            Console.WriteLine("Error: File is not a valid PDF file: " + file1);
        else
            f1 = f1.Substring(0, i) + f1.Substring(i + 75);

        // Remove PDF ID from file2
        i = f2.LastIndexOf("/ID [<");
        if (i < 0)
            Console.WriteLine("Error: File is not a valid PDF file: " + file2);
        else
            f2 = f2.Substring(0, i) + f2.Substring(i + 75);

        return f1 == f2;
    }

Disclaimer: I work for Atalasoft.

Atalasoft's DotImage SDK can be used to extract the text from PDFs in C#. If the PDFs are already searchable you can easily get to the text:

public String GetText(Stream s, int pageNum, int charIndex, int count)
{
   using (PdfTextDocument doc = new PdfTextDocument(s))
   {
       PdfTextPage textPage = doc.GetPage(pageNum);                    
       return textPage.GetText(charIndex, count);
   }
}

Otherwise, you could use the OCR tools to detect the text on the image.

继续阅读：.net-3.5

What is the best way of comparing two pdf file in c#?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？