开发者

Extracting text and retaining formatting

Is there an option开发者_Go百科 to extract text from a PDF doc, with the ITextSharp library, and retain formatting eg. the new line and tab characters.


When extracting text the tab characters will come out, assuming that they actually are tab characters. I don't believe that new line characters can be determined without manually keeping track of the current text coordinates. You might be able to count the number of Td tokens between BT and ET and subtract 1 but that's just a guess.

EDIT

Never mind on the token thing, I thought that was used only for line readjustment (new line) but I was wrong.


I suggest you write your own TextExtractionStrategy based on LocationTextExtractionStrategy.

You'll need to track where the baselines are to determine newlines.

Actually, LocationTextExtractionStrategy just might add the newlines for you. Either way, that's where you need to start.


It turns out the formatting "\r\n" is indeed retained verified by fetching the value from SQL Server table programatically and invoking Console.writeline(). Initially I was copying the value directly from SQL Server Management studio and pasting into text file - which surely isn't the right way to verify.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜