开发者

How do I modify PDF without a library using C# and stream it back to client in ASP.NET?

I'm having an issue where I'm corrupted a PDF and not sure of a proper solution. I've seen several posts on people trying to just do a basic stream or trying to modify the file with a third party library. This is how my situation differs...

I have all the web pieces in place to get me the PDF streamed back and it works fine until I try to modify it with C#.

  1. I've modified the PDF in a text editor manually to remove the <> entries and tested that the PDF functions properly after that.

  2. I've then programmatically streamed the PDF in as byte[] from the database, convert it to a string, using a RegEx to find and remove the same stuff I tried removing manually.

  3. THE PROBLEM! W开发者_运维技巧hen I try to convert the modified PDF string contents back into a byte[] to stream back, the PDF encoding no longer seems to be correct. What is the correct encoding?

Does anyone know the best way to do something like this? I'm just trying to keep my solution as light as possible because our site is geared towards PDF document access so heavy APIs or complex are not preferable unless no other options are available. Also, because this situation is really only when our users view the file in an iframe for "preview", I can't permanently modify the PDF.

Thanks for your help in advance!


Try to use the following BinaryEncoding class as encoding. It basically casts all bytes to chars (and back), so that only ASCII data can correctly be processed as string, but the rest of the data is kept unchanged and nothing is lost as long as you don't use any UNICODE characters > 0x00FF. So for your roundtrip it should work just fine.

public class BinaryEncoding: Encoding {
    private static readonly BinaryEncoding @default = new BinaryEncoding();

    public static new BinaryEncoding Default {
        get {
            return @default;
        }
    }

    public override int GetByteCount(char[] chars, int index, int count) {
        if (chars == null) {
            throw new ArgumentNullException("chars");
        }
        return count;
    }

    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
        if (chars == null) {
            throw new ArgumentNullException("chars");
        }
        if (bytes == null) {
            throw new ArgumentNullException("bytes");
        }
        if (charCount < 0) {
            throw new ArgumentOutOfRangeException("charCount");
        }
        unchecked {
            for (int i = 0; i < charCount; i++) {
                bytes[byteIndex+i] = (byte)chars[charIndex+i];
            }
        }
        return charCount;
    }

    public override int GetCharCount(byte[] bytes, int index, int count) {
        if (bytes == null) {
            throw new ArgumentNullException("bytes");
        }
        return count;
    }

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
        if (bytes == null) {
            throw new ArgumentNullException("bytes");
        }
        if (chars == null) {
            throw new ArgumentNullException("chars");
        }
        if (byteCount < 0) {
            throw new ArgumentOutOfRangeException("byteCount");
        }
        unchecked {
            for (int i = 0; i < byteCount; i++) {
                chars[charIndex+i] = (char)bytes[byteIndex+i];
            }
        }
        return byteCount;
    }

    public override int GetMaxByteCount(int charCount) {
        return charCount;
    }

    public override int GetMaxCharCount(int byteCount) {
        return byteCount;
    }
}


You seem to be discovering that...

the PDF format is not trivial!

Whereby it may be OK (yet kludgey) to patch a few "text" bytes, in-situ (i.e. keeping size and structure unchanged), "messing" much more that that with the PDF files typically ends up breaking them. Regular expression for sure seem to be a blunt tool for the job.

The PDF file needs to be parsed and seen as a hierarchical collection objects (and then some..), and that's why we need the libraries which encapsulate the knowledge about the format.

If you need convincing, you may peruse the, now ISO standard, specification for the PDF Format (version 1.7) available for free on Adobe web site. BTW, these 750 pages cover the latest version, while there's much overlay, previous versions introduce yet another layer of details to contend with...

Edit:
This said, in re-reading the question, and Lucero's remark, the changes indicated do seem small/safe enough that a "snip and tuck" approach may work.
Beware that this type of approach may lead to issues, over time (when the format encountered is of a different, older or newer!, version, or when the file content, somehow causes different structures to be exposed, or...) or also with some specific uses (for example it may prevent users to use some features of the PDF documents such as forms or security). Maybe a compromise is to learn enough about the format(s) at hand and confirm that the changes are indeed casual.

Also... while the PDF format is a relatively complicated affair, the libraries that deal with it are not necessarily heavy, and they are typically easy to use.

In short, you'll need to weight the benefits and drawbacks of both approaches and pick accordingly ;-) (how was that for a "non-answer").


Look into IText. There is a reason why things like the apache commons library exist.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜